1 Loads & Install Packages

# installation of packages in case the user has not installed yet
if (!require("nnet")) install.packages("nnet")
## Caricamento del pacchetto richiesto: nnet
if (!require("MASS")) install.packages("MASS")
## Caricamento del pacchetto richiesto: MASS
if (!require("e1071")) install.packages("e1071")
## Caricamento del pacchetto richiesto: e1071
if (!require("class")) install.packages("class")
## Caricamento del pacchetto richiesto: class
if (!require("leaps")) install.packages("leaps")
## Caricamento del pacchetto richiesto: leaps
if (!require("glmnet")) install.packages("glmnet")
## Caricamento del pacchetto richiesto: glmnet
## Caricamento del pacchetto richiesto: Matrix
## Loaded glmnet 4.1-8
if (!require("car")) install.packages("car")
## Caricamento del pacchetto richiesto: car
## Caricamento del pacchetto richiesto: carData
if (!require("caTools")) install.packages("caTools")
## Caricamento del pacchetto richiesto: caTools
if (!require("mgcv")) install.packages("mgcv")
## Caricamento del pacchetto richiesto: mgcv
## Caricamento del pacchetto richiesto: nlme
## This is mgcv 1.9-0. For overview type 'help("mgcv-package")'.
## 
## Caricamento pacchetto: 'mgcv'
## Il seguente oggetto è mascherato da 'package:nnet':
## 
##     multinom
if (!require("pROC")) install.packages("pRoc")
## Caricamento del pacchetto richiesto: pROC
## Type 'citation("pROC")' for a citation.
## 
## Caricamento pacchetto: 'pROC'
## I seguenti oggetti sono mascherati da 'package:stats':
## 
##     cov, smooth, var
if (!require("summarytools")) install.packages("summarytools")
## Caricamento del pacchetto richiesto: summarytools
if (!require("dplyr")) install.packages("dplyr")
## Caricamento del pacchetto richiesto: dplyr
## 
## Caricamento pacchetto: 'dplyr'
## Il seguente oggetto è mascherato da 'package:nlme':
## 
##     collapse
## Il seguente oggetto è mascherato da 'package:car':
## 
##     recode
## Il seguente oggetto è mascherato da 'package:MASS':
## 
##     select
## I seguenti oggetti sono mascherati da 'package:stats':
## 
##     filter, lag
## I seguenti oggetti sono mascherati da 'package:base':
## 
##     intersect, setdiff, setequal, union
if (!require("ggplot2")) install.packages("ggplot2")
## Caricamento del pacchetto richiesto: ggplot2
if (!require("tidyverse")) install.packages("tidyverse")
## Caricamento del pacchetto richiesto: tidyverse
## ── Attaching core tidyverse packages ──────────────────────── tidyverse 2.0.0 ──
## âś” forcats   1.0.0     âś” stringr   1.5.1
## âś” lubridate 1.9.3     âś” tibble    3.2.1
## âś” purrr     1.0.2     âś” tidyr     1.3.0
## âś” readr     2.1.5     
## ── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
## âś– dplyr::collapse() masks nlme::collapse()
## âś– tidyr::expand()   masks Matrix::expand()
## âś– dplyr::filter()   masks stats::filter()
## âś– dplyr::lag()      masks stats::lag()
## âś– tidyr::pack()     masks Matrix::pack()
## âś– dplyr::recode()   masks car::recode()
## âś– dplyr::select()   masks MASS::select()
## âś– purrr::some()     masks car::some()
## âś– tidyr::unpack()   masks Matrix::unpack()
## âś– tibble::view()    masks summarytools::view()
## ℹ Use the conflicted package (<http://conflicted.r-lib.org/>) to force all conflicts to become errors
if (!require("lubridate")) install.packages("lubridate")
if (!require("mapview")) install.packages("mapview")
## Caricamento del pacchetto richiesto: mapview
if (!require("leafpop")) install.packages("leafpop")
## Caricamento del pacchetto richiesto: leafpop
if (!require("sf")) install.packages("sf")
## Caricamento del pacchetto richiesto: sf
## Linking to GEOS 3.11.2, GDAL 3.7.2, PROJ 9.3.0; sf_use_s2() is TRUE
if (!require("geojsonio")) install.packages("geojsonio")
## Caricamento del pacchetto richiesto: geojsonio
## Registered S3 method overwritten by 'geojsonsf':
##   method        from   
##   print.geojson geojson
## 
## Caricamento pacchetto: 'geojsonio'
## 
## Il seguente oggetto è mascherato da 'package:base':
## 
##     pretty
if (!require("leaflet")) install.packages("leaflet")
## Caricamento del pacchetto richiesto: leaflet
if (!require("broom")) install.packages("broom")
## Caricamento del pacchetto richiesto: broom
if (!require("plotly")) install.packages("plotly")
## Caricamento del pacchetto richiesto: plotly
## 
## Caricamento pacchetto: 'plotly'
## 
## Il seguente oggetto è mascherato da 'package:ggplot2':
## 
##     last_plot
## 
## Il seguente oggetto è mascherato da 'package:MASS':
## 
##     select
## 
## Il seguente oggetto è mascherato da 'package:stats':
## 
##     filter
## 
## Il seguente oggetto è mascherato da 'package:graphics':
## 
##     layout
if (!require("gridExtra")) install.packages("gridExtra")
## Caricamento del pacchetto richiesto: gridExtra
## 
## Caricamento pacchetto: 'gridExtra'
## 
## Il seguente oggetto è mascherato da 'package:dplyr':
## 
##     combine
# laoding of packages
library(nnet)
library(MASS)
library(e1071)
library(class)
library(leaps)
library(glmnet)
library(car)
library(caTools)
library(mgcv)
library(pROC)

library(summarytools)
library(dplyr)
library(ggplot2)
library(tidyverse)
library(lubridate)
library(mapview)
library(sf)
library(geojsonio)
library(leaflet) 
library(broom)
library(plotly)
library(gridExtra)
library(leafpop)

2 Dataset description

The Fire Incident Dispatch Data file contains data that is generated by the Starfire Computer Aided Dispatch System. The data spans from the time the incident is created in the system to the time the incident is closed in the system. It covers information about the incident as it relates to the assignment of resources and the Fire Department’s response to the emergency. To protect personal identifying information in accordance with the Health Insurance Portability and Accountability Act (HIPAA), specific locations of incidents are not included and have been aggregated to a higher level of detail.

In this analysis we have restricted the analysis only on the last 50.000 observations from 5th of September to 30th of the same month.

  1. STARFIRE_INCIDENT_ID: An incident identifier comprising the 5 character julian date, 4 character alarm box number, 2 character number of incidents at the box so far for the day, 1 character borough code , 4 character sequence number.

  2. INCIDENT_DATETIME: The date and time of the incident.

  3. ALARM_BOX_BOROUGH: The borough of the alarm box.

  4. ALARM_BOX_LOCATION: The location of the alarm box.

  5. ALARM_BOX: The alarm box number.

  6. INCIDENT_BOROUGH: The borough of the incident.

  7. ZIPCODE: The zip code of the incident.

  8. POLICEPRECINCT: The police precinct of the incident.

  9. CITYCOUNCILDISTRICT: The city council district.

  10. COMMUNITYDISTRICT: The community district.

  11. COMMUNITYSCHOOLDISTRICT: The community school district.

  12. CONGRESSIONALDISTRICT: The congressional district.

  13. ALARM_SOURCE_DESCRIPTION_TX: The description of the alarm source.

  14. ALARM_LEVEL_INDEX_DESCRIPTION: The alarm level index.

  15. HIGHEST_ALARM_LEVEL: The highest alarm level.

  16. INCIDENT_CLASSIFICATION: The incident classification.

  17. INCIDENT_CLASSIFICATION_GROUP: The incident classification roll up group.

  18. FIRST_ASSIGNMENT_DATETIME: The date and time of the first unit assignment.

  19. FIRST_ACTIVATION_DATETIME: The date and time of the first unit acknowledgement of the assignment.

  20. FIRST_ON_SCENE_DATETIME: The date and time of the first unit at the scene of the incident.

  21. INCIDENT_CLOSE_DATETIME: The date and time that the incident was closed in the dispatch system.

  22. VALID_DISPATCH_RSPNS_TIME_INDC: Indicates that the components comprising the generation of the DISPATCH_RESPONSE_SECONDS_QY are valid.

  23. DISPATCH_RESPONSE_SECONDS_QY: The elapsed time in seconds between the INCIDENT_DATETIME and the FIRST_ASSIGNMENT_DATETIME.

  24. VALID_INCIDENT_RSPNS_TIME_INDC: Indicates that the components comprising the generation of the INCIDENT_RESPONSE_SECONDS_QY are valid.

  25. INCIDENT_RESPONSE_SECONDS_QY: The elapsed time in seconds between the INCIDENT_DATETIME and the FIRST_ON_SCENE_DATETIME.

  26. INCIDENT_TRAVEL_TM_SECONDS_QY: The elapsed time in seconds between the FIRST_ASSIGNMENT_DATETIME and the FIRST_ON_SCENE_DATETIME.

  27. ENGINES_ASSIGNED_QUANTITY: The number of engine units assigned to the incident.

  28. LADDERS_ASSIGNED_QUANTITY: The number of ladder units assigned to the incident.

  29. OTHER_UNITS_ASSIGNED_QUANTITY: The number of units that are not engines or ladders that were assigned to the incident.

2.1 Analysis Description

We will try create two different analysis.

  1. The aim to predict the INCIDENT_RESPONSE_SECONDS_QY which is the time difference between the FIRST_ON_SCENE_DATETIME and INCIDENT_DATETIME.
  2. The focus is to predict the EMERGENCY_TIME which is the time difference between the FIRST_ON_SCENE_DATETIME and INCIDENT_CLOSE_DATETIME.

Both analysis use a linear regression model, however we will see that the assumption for applying the linear regression will be not meet, thus we will simplify our project moving into classification, dividing in two or more ranges the two responses. In addition of this we will perform data exploration and cleaning, studying the presence or not of pattern of NA values and invalid values.

Next we will try to answer the following questions:

  • Is more likely to have a quick response or emergency time living in a particular borough / city council district?
  • Is more likely to have a quick response or emergency time in a particular time of the day?
  • Is more likely to have a quick response or emergency time on the week end?
  • Is more likely to have a quick response or emergency time respect to the class emergency?
  • Is more likely to have a quick response or emergency time respect to the assigned units?

3 Data Exlporation and Cleaning

The first step is always to read the dataset, plot the first 5 observations and see its whole dimension.

fire_data <- read.csv("datasets/Fire_Incident_Dispatch_Data_last_50k.csv")

head(fire_data)
##    STARFIRE_INCIDENT_ID      INCIDENT_DATETIME ALARM_BOX_BOROUGH
## 1 230905-B1937-001-0567 09/05/2023 02:19:04 PM          BROOKLYN
## 2 230905-B3923-002-0568 09/05/2023 02:19:36 PM          BROOKLYN
## 3 230905-X8897-003-0480 09/05/2023 02:19:43 PM             BRONX
## 4 230905-X3466-001-0481 09/05/2023 02:21:00 PM             BRONX
## 5 230905-B2448-001-0570 09/05/2023 02:21:26 PM          BROOKLYN
## 6 230905-B2448-002-0571 09/05/2023 02:22:35 PM          BROOKLYN
##   ALARM_BOX_NUMBER                    ALARM_BOX_LOCATION INCIDENT_BOROUGH
## 1             1937                AUTUMN AVE & FULTON ST         BROOKLYN
## 2             3923          N/S EASTERN PWAY & UTICA AVE         BROOKLYN
## 3             8897 CROSS BX EXPY- DEEGAN EX TO JEROME AV            BRONX
## 4             3466                  ADEE AVE & BX PARK E            BRONX
## 5             2448             GLENWOOD RD & BEDFORD AVE         BROOKLYN
## 6             2448             GLENWOOD RD & BEDFORD AVE         BROOKLYN
##   ZIPCODE POLICEPRECINCT CITYCOUNCILDISTRICT COMMUNITYDISTRICT
## 1   11208             75                  37               305
## 2   11213             71                  35               309
## 3      NA             NA                  NA                NA
## 4   10467             49                  12               211
## 5   11210             70                  45               314
## 6   11210             70                  45               314
##   COMMUNITYSCHOOLDISTRICT CONGRESSIONALDISTRICT ALARM_SOURCE_DESCRIPTION_TX
## 1                      19                     7                         EMS
## 2                      17                     9                     CLASS-3
## 3                      NA                    NA                     EMS-911
## 4                      11                    15                         EMS
## 5                      22                     9                         EMS
## 6                      22                     9                         EMS
##   ALARM_LEVEL_INDEX_DESCRIPTION HIGHEST_ALARM_LEVEL
## 1                 Initial Alarm         First Alarm
## 2                 Initial Alarm         First Alarm
## 3                 Initial Alarm         First Alarm
## 4                DEFAULT RECORD         First Alarm
## 5                DEFAULT RECORD         First Alarm
## 6                DEFAULT RECORD         First Alarm
##                  INCIDENT_CLASSIFICATION INCIDENT_CLASSIFICATION_GROUP
## 1 Medical - No PT Contact EMS is Onscene           Medical Emergencies
## 2                          Hospital Fire              Structural Fires
## 3               Vehicle Accident - Other        NonMedical Emergencies
## 4               Medical - EMS Link 10-91           Medical Emergencies
## 5               Medical - EMS Link 10-91           Medical Emergencies
## 6               Medical - EMS Link 10-91           Medical Emergencies
##   DISPATCH_RESPONSE_SECONDS_QY FIRST_ASSIGNMENT_DATETIME
## 1                            7    09/05/2023 02:19:12 PM
## 2                           95    09/05/2023 02:21:11 PM
## 3                           41    09/05/2023 02:20:25 PM
## 4                          298    09/05/2023 02:25:59 PM
## 5                           25    09/05/2023 02:21:52 PM
## 6                          350    09/05/2023 02:28:25 PM
##   FIRST_ACTIVATION_DATETIME FIRST_ON_SCENE_DATETIME INCIDENT_CLOSE_DATETIME
## 1    09/05/2023 02:19:26 PM  09/05/2023 02:25:23 PM  09/05/2023 03:03:15 PM
## 2    09/05/2023 02:21:33 PM  09/05/2023 02:23:21 PM  09/05/2023 02:34:18 PM
## 3    09/05/2023 02:20:35 PM  09/05/2023 02:26:22 PM  09/05/2023 04:13:32 PM
## 4    09/05/2023 02:26:04 PM                          09/05/2023 02:34:23 PM
## 5    09/05/2023 02:22:08 PM                          09/05/2023 02:28:07 PM
## 6                                                    09/05/2023 02:29:09 PM
##   VALID_DISPATCH_RSPNS_TIME_INDC VALID_INCIDENT_RSPNS_TIME_INDC
## 1                              N                              Y
## 2                              N                              Y
## 3                              N                              Y
## 4                              N                              N
## 5                              N                              N
## 6                              N                              N
##   INCIDENT_RESPONSE_SECONDS_QY INCIDENT_TRAVEL_TM_SECONDS_QY
## 1                          378                           371
## 2                          224                           129
## 3                          398                           357
## 4                           NA                            NA
## 5                           NA                            NA
## 6                           NA                            NA
##   ENGINES_ASSIGNED_QUANTITY LADDERS_ASSIGNED_QUANTITY
## 1                         1                         0
## 2                         3                         2
## 3                         2                         3
## 4                         1                         0
## 5                         1                         0
## 6                         1                         0
##   OTHER_UNITS_ASSIGNED_QUANTITY
## 1                             0
## 2                             1
## 3                             1
## 4                             0
## 5                             0
## 6                             0
dim(fire_data)
## [1] 50000    29

Use dfSummary from summarytool in order to have a complete and clear sumamry of the dataset.

print(dfSummary(fire_data, 
                plain.ascii  = FALSE, 
                style        = "multiline", 
                headings     = FALSE,
                graph.magnif = 0.8, 
                valid.col    = FALSE),
                method = 'render')
No Variable Stats / Values Freqs (% of Valid) Graph Missing
1 STARFIRE_INCIDENT_ID [character]
1. 230905-B0042-001-1051
2. 230905-B0053-001-0760
3. 230905-B0053-002-0910
4. 230905-B0081-001-1137
5. 230905-B0106-002-0632
6. 230905-B0132-001-0713
7. 230905-B0147-001-0967
8. 230905-B0160-001-1125
9. 230905-B0163-001-1026
10. 230905-B0165-001-0778
[ 49990 others ]
1(0.0%)
1(0.0%)
1(0.0%)
1(0.0%)
1(0.0%)
1(0.0%)
1(0.0%)
1(0.0%)
1(0.0%)
1(0.0%)
49990(100.0%)
0 (0.0%)
2 INCIDENT_DATETIME [character]
1. 09/07/2023 03:53:19 PM
2. 09/11/2023 09:44:33 AM
3. 09/13/2023 12:09:35 AM
4. 09/29/2023 09:44:26 AM
5. 09/05/2023 03:30:51 PM
6. 09/05/2023 03:37:48 PM
7. 09/05/2023 03:53:11 PM
8. 09/05/2023 04:01:29 PM
9. 09/05/2023 04:32:57 PM
10. 09/05/2023 04:59:57 PM
[ 49364 others ]
3(0.0%)
3(0.0%)
3(0.0%)
3(0.0%)
2(0.0%)
2(0.0%)
2(0.0%)
2(0.0%)
2(0.0%)
2(0.0%)
49976(100.0%)
0 (0.0%)
3 ALARM_BOX_BOROUGH [character]
1. BRONX
2. BROOKLYN
3. MANHATTAN
4. QUEENS
5. RICHMOND / STATEN ISLAND
10973(21.9%)
13980(28.0%)
12890(25.8%)
9879(19.8%)
2278(4.6%)
0 (0.0%)
4 ALARM_BOX_NUMBER [integer]
Mean (sd) : 2930.3 (2446.5)
min ≤ med ≤ max:
10 ≤ 2275 ≤ 9933
IQR (CV) : 2772 (0.8)
7411 distinct values 0 (0.0%)
5 ALARM_BOX_LOCATION [character]
1. 8 AVE & W 155 ST
2. 10 RICHMAN PLZ/SEDGWICK A
3. AMSTERDAM AVE & LA SALLE
4. 3 AVE & E 143 ST
5. WASHINGTON AVE & E 170 ST
6. FDR DR & E 6 ST
7. CONCOURSE VILLAGE E & E 1
8. PARK AVE & E 158 ST
9. UNION TPK & WINCHESTER BL
10. 8 AVE & W 33 ST
[ 12203 others ]
85(0.2%)
75(0.1%)
50(0.1%)
48(0.1%)
48(0.1%)
45(0.1%)
44(0.1%)
40(0.1%)
40(0.1%)
39(0.1%)
49486(99.0%)
0 (0.0%)
6 INCIDENT_BOROUGH [character]
1. BRONX
2. BROOKLYN
3. MANHATTAN
4. QUEENS
5. RICHMOND / STATEN ISLAND
10973(21.9%)
13980(28.0%)
12890(25.8%)
9879(19.8%)
2278(4.6%)
0 (0.0%)
7 ZIPCODE [integer]
Mean (sd) : 10737.9 (551.8)
min ≤ med ≤ max:
10000 ≤ 10472 ≤ 11697
IQR (CV) : 1098 (0.1)
217 distinct values 3181 (6.4%)
8 POLICEPRECINCT [integer]
Mean (sd) : 62.3 (34.8)
min ≤ med ≤ max:
1 ≤ 61 ≤ 123
IQR (CV) : 56 (0.6)
77 distinct values 3180 (6.4%)
9 CITYCOUNCILDISTRICT [integer]
Mean (sd) : 23.1 (15.1)
min ≤ med ≤ max:
1 ≤ 21 ≤ 51
IQR (CV) : 27 (0.7)
51 distinct values 3180 (6.4%)
10 COMMUNITYDISTRICT [integer]
Mean (sd) : 262.9 (119.4)
min ≤ med ≤ max:
101 ≤ 302 ≤ 595
IQR (CV) : 206 (0.5)
70 distinct values 3180 (6.4%)
11 COMMUNITYSCHOOLDISTRICT [integer]
Mean (sd) : 14.8 (9.7)
min ≤ med ≤ max:
1 ≤ 13 ≤ 32
IQR (CV) : 18 (0.7)
32 distinct values 3182 (6.4%)
12 CONGRESSIONALDISTRICT [integer]
Mean (sd) : 10.4 (3.3)
min ≤ med ≤ max:
3 ≤ 11 ≤ 16
IQR (CV) : 5 (0.3)
13 distinct values 3180 (6.4%)
13 ALARM_SOURCE_DESCRIPTION_TX [character]
1. 911
2. 911TEXT
3. BARS
4. CLASS-3
5. EMS
6. EMS-911
7. ERS
8. ERS-NC
9. PHONE
10. SOL
11. VERBAL
302(0.6%)
14(0.0%)
1(0.0%)
5025(10.1%)
17178(34.4%)
10520(21.0%)
777(1.6%)
1(0.0%)
15146(30.3%)
5(0.0%)
1031(2.1%)
0 (0.0%)
14 ALARM_LEVEL_INDEX_DESCRIPTION [character]
1. 10-75 Signal (Request for
2. 10-76 & 10-77 Signal (Not
3. 7-5 (All Hands Alarm)
4. DEFAULT RECORD
5. Initial Alarm
6. Second Alarm
7. Third Alarm
13(0.0%)
3(0.0%)
100(0.2%)
17313(34.6%)
32562(65.1%)
8(0.0%)
1(0.0%)
0 (0.0%)
15 HIGHEST_ALARM_LEVEL [character]
1. All Hands Working
2. First Alarm
3. Second Alarm
4. Third Alarm
100(0.2%)
49891(99.8%)
8(0.0%)
1(0.0%)
0 (0.0%)
16 INCIDENT_CLASSIFICATION [character]
1. Medical - EMS Link 10-91
2. Medical - PD Link 10-91
3. Medical - Breathing / Ill
4. Medical - No PT Contact E
5. Assist Civilian - Non-Med
6. Alarm System - Unnecessar
7. Elevator Emergency - Occu
8. Vehicle Accident - Other
9. Utility Emergency - Gas
10. Odor - Other Than Smoke
[ 57 others ]
9509(19.0%)
5741(11.5%)
5453(10.9%)
5013(10.0%)
4140(8.3%)
2845(5.7%)
1954(3.9%)
1543(3.1%)
1359(2.7%)
1337(2.7%)
11106(22.2%)
0 (0.0%)
17 INCIDENT_CLASSIFICATION_GROUP [character]
1. Medical Emergencies
2. Medical MFAs
3. NonMedical Emergencies
4. NonMedical MFAs
5. NonStructural Fires
6. Structural Fires
26824(53.6%)
208(0.4%)
19072(38.1%)
1680(3.4%)
703(1.4%)
1513(3.0%)
0 (0.0%)
18 DISPATCH_RESPONSE_SECONDS_QY [integer]
Mean (sd) : 40 (133.1)
min ≤ med ≤ max:
2 ≤ 19 ≤ 9023
IQR (CV) : 33 (3.3)
841 distinct values 0 (0.0%)
19 FIRST_ASSIGNMENT_DATETIME [character]
1. 09/06/2023 01:40:49 PM
2. 09/08/2023 02:43:53 PM
3. 09/20/2023 01:09:29 PM
4. 09/27/2023 02:04:41 PM
5. 09/05/2023 02:34:37 PM
6. 09/05/2023 03:38:43 PM
7. 09/05/2023 03:48:54 PM
8. 09/05/2023 03:56:07 PM
9. 09/05/2023 05:01:22 PM
10. 09/05/2023 07:13:08 PM
[ 49499 others ]
3(0.0%)
3(0.0%)
3(0.0%)
3(0.0%)
2(0.0%)
2(0.0%)
2(0.0%)
2(0.0%)
2(0.0%)
2(0.0%)
49976(100.0%)
0 (0.0%)
20 FIRST_ACTIVATION_DATETIME [character]
1. (Empty string)
2. 09/22/2023 02:07:47 PM
3. 09/07/2023 06:59:12 PM
4. 09/10/2023 06:28:10 PM
5. 09/17/2023 07:27:03 PM
6. 09/23/2023 08:01:29 PM
7. 09/25/2023 10:17:28 AM
8. 09/29/2023 08:16:05 AM
9. 09/05/2023 02:47:25 PM
10. 09/05/2023 03:00:12 PM
[ 49196 others ]
139(0.3%)
4(0.0%)
3(0.0%)
3(0.0%)
3(0.0%)
3(0.0%)
3(0.0%)
3(0.0%)
2(0.0%)
2(0.0%)
49835(99.7%)
0 (0.0%)
21 FIRST_ON_SCENE_DATETIME [character]
1. (Empty string)
2. 09/30/2023 04:01:43 PM
3. 09/05/2023 03:17:20 PM
4. 09/05/2023 03:27:35 PM
5. 09/05/2023 04:37:22 PM
6. 09/05/2023 04:38:47 PM
7. 09/05/2023 04:39:30 PM
8. 09/05/2023 05:44:27 PM
9. 09/05/2023 05:55:56 PM
10. 09/05/2023 08:59:49 PM
[ 35543 others ]
14112(28.2%)
3(0.0%)
2(0.0%)
2(0.0%)
2(0.0%)
2(0.0%)
2(0.0%)
2(0.0%)
2(0.0%)
2(0.0%)
35869(71.7%)
0 (0.0%)
22 INCIDENT_CLOSE_DATETIME [character]
1. 09/05/2023 06:13:06 PM
2. 09/10/2023 02:16:37 PM
3. 09/24/2023 04:10:06 PM
4. 09/25/2023 12:20:57 AM
5. 09/27/2023 04:38:25 PM
6. 09/29/2023 10:42:38 AM
7. 09/30/2023 06:06:40 PM
8. 09/05/2023 03:25:13 PM
9. 09/05/2023 04:08:06 PM
10. 09/05/2023 05:08:09 PM
[ 49399 others ]
3(0.0%)
3(0.0%)
3(0.0%)
3(0.0%)
3(0.0%)
3(0.0%)
3(0.0%)
2(0.0%)
2(0.0%)
2(0.0%)
49973(99.9%)
0 (0.0%)
23 VALID_DISPATCH_RSPNS_TIME_INDC [character] 1. N
50000(100.0%)
0 (0.0%)
24 VALID_INCIDENT_RSPNS_TIME_INDC [character]
1. N
2. Y
17036(34.1%)
32964(65.9%)
0 (0.0%)
25 INCIDENT_RESPONSE_SECONDS_QY [integer]
Mean (sd) : 380.7 (233.2)
min ≤ med ≤ max:
18 ≤ 334 ≤ 7130
IQR (CV) : 161 (0.6)
1496 distinct values 14112 (28.2%)
26 INCIDENT_TRAVEL_TM_SECONDS_QY [integer]
Mean (sd) : 340.5 (208.6)
min ≤ med ≤ max:
0 ≤ 301 ≤ 7122
IQR (CV) : 159 (0.6)
1382 distinct values 14112 (28.2%)
27 ENGINES_ASSIGNED_QUANTITY [integer]
Mean (sd) : 1.1 (0.8)
min ≤ med ≤ max:
0 ≤ 1 ≤ 19
IQR (CV) : 0 (0.7)
15 distinct values 62 (0.1%)
28 LADDERS_ASSIGNED_QUANTITY [integer]
Mean (sd) : 0.6 (0.8)
min ≤ med ≤ max:
0 ≤ 0 ≤ 15
IQR (CV) : 1 (1.4)
12 distinct values 62 (0.1%)
29 OTHER_UNITS_ASSIGNED_QUANTITY [integer]
Mean (sd) : 0.3 (0.8)
min ≤ med ≤ max:
0 ≤ 0 ≤ 32
IQR (CV) : 0 (2.8)
23 distinct values 62 (0.1%)

Generated by summarytools 1.0.1 (R version 4.3.2)
2024-01-17

Now we decide to rename all the columns in order to be of smaller length once we plots some charts.

fire_data <- fire_data %>%
  rename(id = STARFIRE_INCIDENT_ID, datetime = INCIDENT_DATETIME, al_borough = ALARM_BOX_BOROUGH,
    al_number = ALARM_BOX_NUMBER,al_location = ALARM_BOX_LOCATION, inc_borough = INCIDENT_BOROUGH,
    zipcode = ZIPCODE, pol_prec = POLICEPRECINCT, city_con_dist = CITYCOUNCILDISTRICT,
    commu_dist = COMMUNITYDISTRICT, commu_sc_dist = COMMUNITYSCHOOLDISTRICT,
    cong_dist = CONGRESSIONALDISTRICT, al_source_desc = ALARM_SOURCE_DESCRIPTION_TX,
    al_index_desc = ALARM_LEVEL_INDEX_DESCRIPTION, highest_al_level = HIGHEST_ALARM_LEVEL,
    inc_class = INCIDENT_CLASSIFICATION, inc_class_group = INCIDENT_CLASSIFICATION_GROUP,
    first_ass_datetime = FIRST_ASSIGNMENT_DATETIME, first_act_datetime = FIRST_ACTIVATION_DATETIME,
    first_onscene_datetime = FIRST_ON_SCENE_DATETIME, inc_close_datetime = INCIDENT_CLOSE_DATETIME, 
                   
    disp_resp_sec_qy = DISPATCH_RESPONSE_SECONDS_QY, disp_resp_sec_indc = VALID_DISPATCH_RSPNS_TIME_INDC,
    inc_resp_sec_qy = INCIDENT_RESPONSE_SECONDS_QY, inc_resp_sec_indc = VALID_INCIDENT_RSPNS_TIME_INDC,
                   
    inc_travel_sec_qy = INCIDENT_TRAVEL_TM_SECONDS_QY, 
                   
    engines_assigned = ENGINES_ASSIGNED_QUANTITY,
    ladders_assigned = LADDERS_ASSIGNED_QUANTITY, others_units_assigned = OTHER_UNITS_ASSIGNED_QUANTITY)

As we can see there is lots of work to do, in fact we identify the following problem:

  1. Many NA values
  2. Lots of predictors are characters and not factor
  3. Many future factorial predictors have values that occurs very small time, this suggest that we can inglobe those in a bigger category
  4. The differences of datatime have huge variation, we can think on scaling the relative values
  5. Possible duplicate columns

We start by converting the non factorial columns into factorial one. Then we perform accurate data exploration and cleaning in each macro columns set.

# set factorial
fire_data$inc_borough <- as.factor(fire_data$inc_borough)
fire_data$al_borough <- as.factor(fire_data$al_borough)
fire_data$al_source_desc <- as.factor(fire_data$al_source_desc)
fire_data$al_index_desc <- as.factor(fire_data$al_index_desc)
fire_data$highest_al_level <- as.factor(fire_data$highest_al_level)
fire_data$cong_dist <- as.factor(fire_data$cong_dist)

fire_data$disp_resp_sec_indc <- as.factor(fire_data$disp_resp_sec_indc)
levels(fire_data$disp_resp_sec_indc)<- c("N", "Y")

fire_data$inc_resp_sec_indc <- as.factor(fire_data$inc_resp_sec_indc)
levels(fire_data$inc_resp_sec_indc)<- c("N", "Y")

fire_data$inc_class_group <- as.factor(fire_data$inc_class_group)
fire_data$inc_class <- as.factor(fire_data$inc_class)

3.1 Process Temporal Data

For the temporal data processing we employ a little of feature engineering. Indeed we decide to add the following predictors:

  1. day_type: a factorial predictor to indicate in the incident day is a week day or not
  2. time_of_day: a factorial predictor that indicates the range of time whenever the incident happens, so Night (if the hour is between 0 and 6), Morning (if the hour is between 6 and 12), Afternoon (if the hour is between 12 and 18), Evening (if the hour is between 18 and 24).
  3. emergency_min_qy: which represents the difference between the inc_close_datetime and the first_onscene_datetime.
  4. working_hour: indicates if the incident between 8AM and 7PM, so the classic working hour.

Moreover since we are dealing with datetime we also check if the differences (inc_resp_sec_qy, inc_travel_sec_qy and disp_resp_sec_qy) are actually corrects, if not we replace them with the correct one.

Now we note that the maximum level of the time differences is very high to be considered as seconds so we decided to scale the two indicators in minutes.

summary(fire_data %>% select(inc_resp_sec_qy, inc_travel_sec_qy, disp_resp_sec_qy))
##  inc_resp_sec_qy  inc_travel_sec_qy disp_resp_sec_qy 
##  Min.   :  18.0   Min.   :   0.0    Min.   :   2.00  
##  1st Qu.: 265.0   1st Qu.: 233.0    1st Qu.:   7.00  
##  Median : 334.0   Median : 301.0    Median :  19.00  
##  Mean   : 380.7   Mean   : 340.5    Mean   :  39.96  
##  3rd Qu.: 426.0   3rd Qu.: 392.0    3rd Qu.:  40.00  
##  Max.   :7130.0   Max.   :7122.0    Max.   :9023.00  
##  NA's   :14112    NA's   :14112
# scaling
fire_data$inc_resp_sec_qy <- fire_data$inc_resp_sec_qy / 60
fire_data$inc_travel_sec_qy <- fire_data$inc_travel_sec_qy / 60
fire_data$disp_resp_sec_qy <- fire_data$disp_resp_sec_qy / 60 

# renaming both quantity and indicator predictors for the two datetime 
fire_data <- fire_data %>% rename(inc_resp_min_qy = inc_resp_sec_qy, inc_travel_min_qy = inc_travel_sec_qy, disp_resp_min_qy = disp_resp_sec_qy, # quantity
                                  inc_resp_min_indc = inc_resp_sec_indc, disp_resp_min_indc = disp_resp_sec_indc) # indicator

Perform the datetime feature engineering that we have discussed before.

# Process datetime column
fire_data$datetime <- mdy_hms(fire_data$datetime)
fire_data$first_ass_datetime <- mdy_hms(fire_data$first_ass_datetime)
fire_data$first_act_datetime <- mdy_hms(fire_data$first_act_datetime)
fire_data$first_onscene_datetime <- mdy_hms(fire_data$first_onscene_datetime)
fire_data$inc_close_datetime <- mdy_hms(fire_data$inc_close_datetime)


# checking if the differences are well computed if not change with the correct one

if (!identical(fire_data$inc_resp_min_qy, as.numeric(difftime(fire_data$first_onscene_datetime, fire_data$datetime, units="mins")))){
  fire_data$inc_resp_min_qy <- as.numeric(difftime(fire_data$first_onscene_datetime, fire_data$datetime, units="mins"))
}

if (!identical(fire_data$inc_travel_min_qy, as.numeric(difftime(fire_data$first_onscene_datetime, fire_data$first_ass_datetime, units="mins")))){
  fire_data$inc_travel_min_qy <- as.numeric(difftime(fire_data$first_onscene_datetime, fire_data$first_ass_datetime, units="mins"))
}

if (!identical(fire_data$disp_resp_min_qy, as.numeric(difftime(fire_data$first_ass_datetime, fire_data$datetime, units="mins")))){
  fire_data$disp_resp_min_qy <- as.numeric(difftime(fire_data$first_ass_datetime, fire_data$datetime, units="mins"))
}

# creating emergency_min_qy which describe the time taken by the firefighter to close the emergency after have been arrived to the location 
fire_data$emergency_min_qy <- as.numeric(difftime(fire_data$inc_close_datetime, fire_data$first_onscene_datetime, units="mins"))

# creating day_type
fire_data$day_type <- as.factor(ifelse(weekdays(fire_data$datetime) %in% c("sabato", "domenica"), "Weekend", "Weekday"))

# creating working_hour
fire_data$working_hour <- as.factor(ifelse(hour(fire_data$datetime) >= 19 | hour(fire_data$datetime) < 8, FALSE, TRUE))
  
# creating time_of_day
fire_data$time_of_day <- cut(
    hour(fire_data$datetime),
    breaks = c(0, 6, 12, 18, 24),
    labels = c("Night", "Morning", "Afternoon", "Evening"),
    include.lowest = TRUE,
    right = TRUE
)

# delete the datetime
fire_data$datetime <- NULL

Check the distribution of time_of_day

table(fire_data$time_of_day)
## 
##     Night   Morning Afternoon   Evening 
##      8521     13270     16499     11710
ggplot(data=fire_data %>% 
          group_by(time_of_day) %>%
          summarise(incident_number = n()), 
        aes(x=time_of_day, y=incident_number)) + 
      geom_bar(stat="identity", position=position_dodge()) +
      geom_text(aes(label=incident_number), vjust=1.6, color="white", position = position_dodge(0.9), size=3.5) +
      labs(title = "Time of the Day - Incident Count", x = "Time of the Day", y = "Incident Count")

From this we can see that the higher number of fire incident is registered from 12 PM to 18 PM, whereas the lower number of fire incident happened from the 00 AM to 06 AM.

Check the distribution of day_type

table(fire_data$day_type)
## 
## Weekday Weekend 
##   36482   13518
day_type_table <- table(fire_data$day_type)
day_type_table[1] <- day_type_table[1] / 5
day_type_table[2] <- day_type_table[2] / 2
day_type_table
## 
## Weekday Weekend 
##  7296.4  6759.0

And in proportion we can see that on average there is an higher number of fire incident on the week day respect to the week end days.

Check the distribution of working_hour

table(fire_data$working_hour)
## 
## FALSE  TRUE 
## 21859 28141
table(fire_data$working_hour) / dim(fire_data)[1]
## 
##   FALSE    TRUE 
## 0.43718 0.56282

It seems like that there are slightly more incident during the working hour.

3.2 Process Spatial Data

Rename the factor levels for the inc_borough and al_borough.

fire_data <- fire_data %>% mutate(inc_borough = recode_factor(
                  inc_borough, "BRONX" = "Bronx", "BROOKLYN" = "Brooklyn", "MANHATTAN" = "Manhattan",
                  "QUEENS" = "Queens", "RICHMOND / STATEN ISLAND" = "Staten Island"),
                  
                  al_borough = recode_factor(
                  al_borough, "BRONX" = "Bronx", "BROOKLYN" = "Brooklyn", "MANHATTAN" = "Manhattan",
                  "QUEENS" = "Queens", "RICHMOND / STATEN ISLAND" = "Staten Island"))

Regarding the Spatial Data we decide to keep the borough and also the congressional district since are the two predictors that have the least number of categories. Further in the section of data visualization we have a look on an interactive map of some relevant information for both aspects.

3.3 Merging Factors

At this point we merge some possible value from factorial predictors to make the space of possible choice smaller.

3.3.1 Highest Alarm Level

Here we merge the following factorial values of highest_al_level: All Hands Working, Second Alarm and Third Alarm into NonFirst Alarm. But before doing so let’s see the actual distribution of values.

table(fire_data$highest_al_level)
## 
## All Hands Working       First Alarm      Second Alarm       Third Alarm 
##               100             49891                 8                 1

As we can see the majority of the observations are of the type First Alarm, whereas the other observation reach 109 observation, thus we concatenate the less category values into a single one called NonFirst Alarm

# highest_al_level
fire_data$highest_alarm_lev_new <- fire_data$highest_al_level
levels(fire_data$highest_alarm_lev_new) <- list(
  "First Alarm" = "First Alarm", 
  "NonFirst Alarm" = c("All Hands Working", "Second Alarm", "Third Alarm")
)

ctable(fire_data$highest_al_level, fire_data$highest_alarm_lev_new, prop = 'n', totals = FALSE, headings = FALSE)
## 
## ------------------- ----------------------- ------------- ----------------
##                       highest_alarm_lev_new   First Alarm   NonFirst Alarm
##    highest_al_level                                                       
##   All Hands Working                                     0              100
##         First Alarm                                 49891                0
##        Second Alarm                                     0                8
##         Third Alarm                                     0                1
## ------------------- ----------------------- ------------- ----------------
fire_data$highest_al_level <- fire_data$highest_alarm_lev_new
fire_data$highest_alarm_lev_new <- NULL

3.3.2 Alarm Index Description

Here we merge the following factorial values of al_index_desc: Second Alarm, Third Alarm, 7-5 (All Hands Alarm), 10-76 & 10-77 Signal (Notification Hi-Rise Fire) and 10-75 Signal (Request for all hands alarm) into Others. But before doing so let’s see the actual distribution of values.

table(fire_data$al_index_desc)
## 
##       10-75 Signal (Request for all hands alarm) 
##                                               13 
## 10-76 & 10-77 Signal (Notification Hi-Rise Fire) 
##                                                3 
##                            7-5 (All Hands Alarm) 
##                                              100 
##                                   DEFAULT RECORD 
##                                            17313 
##                                    Initial Alarm 
##                                            32562 
##                                     Second Alarm 
##                                                8 
##                                      Third Alarm 
##                                                1

As we can see the two major category are DEFAULT RECORD and Initial Alarm, whereas the rest of categories occur very few time respect the main two, thus we decide again to merge them.

# al_index_desc
fire_data$alarm_level_idx_new <- fire_data$al_index_desc
levels(fire_data$alarm_level_idx_new) <- list(
  "DEFAULT RECORD" = "DEFAULT RECORD",
  "Initial Alarm" = "Initial Alarm", 
  "Others" = c("Second Alarm", "Third Alarm", "7-5 (All Hands Alarm)", 
               "10-76 & 10-77 Signal (Notification Hi-Rise Fire)",
               "10-75 Signal (Request for all hands alarm)")
)

ctable(fire_data$al_index_desc, fire_data$alarm_level_idx_new, prop = 'n', totals = FALSE, headings = FALSE)
## 
## -------------------------------------------------- --------------------- ---------------- --------------- --------
##                                                      alarm_level_idx_new   DEFAULT RECORD   Initial Alarm   Others
##                                      al_index_desc                                                                
##         10-75 Signal (Request for all hands alarm)                                      0               0       13
##   10-76 & 10-77 Signal (Notification Hi-Rise Fire)                                      0               0        3
##                              7-5 (All Hands Alarm)                                      0               0      100
##                                     DEFAULT RECORD                                  17313               0        0
##                                      Initial Alarm                                      0           32562        0
##                                       Second Alarm                                      0               0        8
##                                        Third Alarm                                      0               0        1
## -------------------------------------------------- --------------------- ---------------- --------------- --------
fire_data$al_index_desc <- fire_data$alarm_level_idx_new
fire_data$alarm_level_idx_new <- NULL

3.3.3 Alarm Source Description

Here we merge the following factorial values of al_source_desc: 911, 911TEXT, VERBAL, BARS, ERS, ERS-NC and SOL into Others. But before doing so let’s see the actual distribution of values.

table(fire_data$al_source_desc)
## 
##     911 911TEXT    BARS CLASS-3     EMS EMS-911     ERS  ERS-NC   PHONE     SOL 
##     302      14       1    5025   17178   10520     777       1   15146       5 
##  VERBAL 
##    1031

We decide to maintain as original the highest four and merge the rest into a new category.

fire_data$alarm_source_desc_new <- fire_data$al_source_desc
levels(fire_data$alarm_source_desc_new) <- list(
  "PHONE" = "PHONE",
  "EMS" = "EMS",
  "EMS-911" = "EMS-911",
  "CLASS-3" = "CLASS-3",
  "Others" = c("911", "911TEXT", "VERBAL", "BARS", "ERS", "ERS-NC", "SOL")
)

ctable(fire_data$al_source_desc, fire_data$alarm_source_desc_new, prop = 'n', totals = FALSE, headings = FALSE)
## 
## ---------------- ----------------------- ------- ------- --------- --------- --------
##                    alarm_source_desc_new   PHONE     EMS   EMS-911   CLASS-3   Others
##   al_source_desc                                                                     
##              911                               0       0         0         0      302
##          911TEXT                               0       0         0         0       14
##             BARS                               0       0         0         0        1
##          CLASS-3                               0       0         0      5025        0
##              EMS                               0   17178         0         0        0
##          EMS-911                               0       0     10520         0        0
##              ERS                               0       0         0         0      777
##           ERS-NC                               0       0         0         0        1
##            PHONE                           15146       0         0         0        0
##              SOL                               0       0         0         0        5
##           VERBAL                               0       0         0         0     1031
## ---------------- ----------------------- ------- ------- --------- --------- --------
fire_data$al_source_desc <- fire_data$alarm_source_desc_new
fire_data$alarm_source_desc_new <- NULL

View again the dataset summary to see the applied changes.

print(dfSummary(fire_data, 
                plain.ascii  = FALSE, 
                style        = "multiline", 
                headings     = FALSE,
                graph.magnif = 0.8, 
                valid.col    = FALSE),
                method = 'render')
No Variable Stats / Values Freqs (% of Valid) Graph Missing
1 id [character]
1. 230905-B0042-001-1051
2. 230905-B0053-001-0760
3. 230905-B0053-002-0910
4. 230905-B0081-001-1137
5. 230905-B0106-002-0632
6. 230905-B0132-001-0713
7. 230905-B0147-001-0967
8. 230905-B0160-001-1125
9. 230905-B0163-001-1026
10. 230905-B0165-001-0778
[ 49990 others ]
1(0.0%)
1(0.0%)
1(0.0%)
1(0.0%)
1(0.0%)
1(0.0%)
1(0.0%)
1(0.0%)
1(0.0%)
1(0.0%)
49990(100.0%)
0 (0.0%)
2 al_borough [factor]
1. Bronx
2. Brooklyn
3. Manhattan
4. Queens
5. Staten Island
10973(21.9%)
13980(28.0%)
12890(25.8%)
9879(19.8%)
2278(4.6%)
0 (0.0%)
3 al_number [integer]
Mean (sd) : 2930.3 (2446.5)
min ≤ med ≤ max:
10 ≤ 2275 ≤ 9933
IQR (CV) : 2772 (0.8)
7411 distinct values 0 (0.0%)
4 al_location [character]
1. 8 AVE & W 155 ST
2. 10 RICHMAN PLZ/SEDGWICK A
3. AMSTERDAM AVE & LA SALLE
4. 3 AVE & E 143 ST
5. WASHINGTON AVE & E 170 ST
6. FDR DR & E 6 ST
7. CONCOURSE VILLAGE E & E 1
8. PARK AVE & E 158 ST
9. UNION TPK & WINCHESTER BL
10. 8 AVE & W 33 ST
[ 12203 others ]
85(0.2%)
75(0.1%)
50(0.1%)
48(0.1%)
48(0.1%)
45(0.1%)
44(0.1%)
40(0.1%)
40(0.1%)
39(0.1%)
49486(99.0%)
0 (0.0%)
5 inc_borough [factor]
1. Bronx
2. Brooklyn
3. Manhattan
4. Queens
5. Staten Island
10973(21.9%)
13980(28.0%)
12890(25.8%)
9879(19.8%)
2278(4.6%)
0 (0.0%)
6 zipcode [integer]
Mean (sd) : 10737.9 (551.8)
min ≤ med ≤ max:
10000 ≤ 10472 ≤ 11697
IQR (CV) : 1098 (0.1)
217 distinct values 3181 (6.4%)
7 pol_prec [integer]
Mean (sd) : 62.3 (34.8)
min ≤ med ≤ max:
1 ≤ 61 ≤ 123
IQR (CV) : 56 (0.6)
77 distinct values 3180 (6.4%)
8 city_con_dist [integer]
Mean (sd) : 23.1 (15.1)
min ≤ med ≤ max:
1 ≤ 21 ≤ 51
IQR (CV) : 27 (0.7)
51 distinct values 3180 (6.4%)
9 commu_dist [integer]
Mean (sd) : 262.9 (119.4)
min ≤ med ≤ max:
101 ≤ 302 ≤ 595
IQR (CV) : 206 (0.5)
70 distinct values 3180 (6.4%)
10 commu_sc_dist [integer]
Mean (sd) : 14.8 (9.7)
min ≤ med ≤ max:
1 ≤ 13 ≤ 32
IQR (CV) : 18 (0.7)
32 distinct values 3182 (6.4%)
11 cong_dist [factor]
1. 3
2. 5
3. 6
4. 7
5. 8
6. 9
7. 10
8. 11
9. 12
10. 13
[ 3 others ]
634(1.4%)
3613(7.7%)
2309(4.9%)
4022(8.6%)
4539(9.7%)
3333(7.1%)
4768(10.2%)
2878(6.1%)
5089(10.9%)
5590(11.9%)
10045(21.5%)
3180 (6.4%)
12 al_source_desc [factor]
1. PHONE
2. EMS
3. EMS-911
4. CLASS-3
5. Others
15146(30.3%)
17178(34.4%)
10520(21.0%)
5025(10.1%)
2131(4.3%)
0 (0.0%)
13 al_index_desc [factor]
1. DEFAULT RECORD
2. Initial Alarm
3. Others
17313(34.6%)
32562(65.1%)
125(0.2%)
0 (0.0%)
14 highest_al_level [factor]
1. First Alarm
2. NonFirst Alarm
49891(99.8%)
109(0.2%)
0 (0.0%)
15 inc_class [factor]
1. Abandoned Derelict Vehicl
2. Alarm System - Defective
3. Alarm System - Testing
4. Alarm System - Unnecessar
5. Assist Civilian - Non-Med
6. Automobile Fire
7. Brush Fire
8. Carbon Monoxide - Code 1
9. Carbon Monoxide - Code 2
10. Carbon Monoxide - Code 3
[ 57 others ]
7(0.0%)
387(0.8%)
728(1.5%)
2845(5.7%)
4140(8.3%)
106(0.2%)
27(0.1%)
813(1.6%)
133(0.3%)
92(0.2%)
40722(81.4%)
0 (0.0%)
16 inc_class_group [factor]
1. Medical Emergencies
2. Medical MFAs
3. NonMedical Emergencies
4. NonMedical MFAs
5. NonStructural Fires
6. Structural Fires
26824(53.6%)
208(0.4%)
19072(38.1%)
1680(3.4%)
703(1.4%)
1513(3.0%)
0 (0.0%)
17 disp_resp_min_qy [numeric]
Mean (sd) : 0.7 (2.2)
min ≤ med ≤ max:
0 ≤ 0.3 ≤ 150.4
IQR (CV) : 0.6 (3.3)
850 distinct values 0 (0.0%)
18 first_ass_datetime [POSIXct, POSIXt]
min : 2023-09-05 14:19:12
med : 2023-09-18 08:10:18.5
max : 2023-10-01 00:05:02
range : 25d 9H 45M 50S
49509 distinct values 0 (0.0%)
19 first_act_datetime [POSIXct, POSIXt]
min : 2023-09-05 14:19:26
med : 2023-09-18 08:09:01
max : 2023-10-01 00:05:16
range : 25d 9H 45M 50S
49205 distinct values 139 (0.3%)
20 first_onscene_datetime [POSIXct, POSIXt]
min : 2023-09-05 14:23:21
med : 2023-09-18 11:08:09.5
max : 2023-10-01 00:09:41
range : 25d 9H 46M 20S
35552 distinct values 14112 (28.2%)
21 inc_close_datetime [POSIXct, POSIXt]
min : 2023-09-05 14:25:05
med : 2023-09-18 08:35:24
max : 2023-10-01 00:58:42
range : 25d 10H 33M 37S
49409 distinct values 0 (0.0%)
22 disp_resp_min_indc [factor]
1. N
2. Y
50000(100.0%)
0(0.0%)
0 (0.0%)
23 inc_resp_min_indc [factor]
1. N
2. Y
17036(34.1%)
32964(65.9%)
0 (0.0%)
24 inc_resp_min_qy [numeric]
Mean (sd) : 6.4 (3.9)
min ≤ med ≤ max:
0.3 ≤ 5.6 ≤ 118.8
IQR (CV) : 2.7 (0.6)
1497 distinct values 14112 (28.2%)
25 inc_travel_min_qy [numeric]
Mean (sd) : 5.7 (3.5)
min ≤ med ≤ max:
0 ≤ 5 ≤ 118.7
IQR (CV) : 2.6 (0.6)
1369 distinct values 14112 (28.2%)
26 engines_assigned [integer]
Mean (sd) : 1.1 (0.8)
min ≤ med ≤ max:
0 ≤ 1 ≤ 19
IQR (CV) : 0 (0.7)
15 distinct values 62 (0.1%)
27 ladders_assigned [integer]
Mean (sd) : 0.6 (0.8)
min ≤ med ≤ max:
0 ≤ 0 ≤ 15
IQR (CV) : 1 (1.4)
12 distinct values 62 (0.1%)
28 others_units_assigned [integer]
Mean (sd) : 0.3 (0.8)
min ≤ med ≤ max:
0 ≤ 0 ≤ 32
IQR (CV) : 0 (2.8)
23 distinct values 62 (0.1%)
29 emergency_min_qy [numeric]
Mean (sd) : 17.8 (27.7)
min ≤ med ≤ max:
0 ≤ 12 ≤ 2615
IQR (CV) : 13.6 (1.6)
4429 distinct values 14112 (28.2%)
30 day_type [factor]
1. Weekday
2. Weekend
36482(73.0%)
13518(27.0%)
0 (0.0%)
31 working_hour [factor]
1. FALSE
2. TRUE
21859(43.7%)
28141(56.3%)
0 (0.0%)
32 time_of_day [factor]
1. Night
2. Morning
3. Afternoon
4. Evening
8521(17.0%)
13270(26.5%)
16499(33.0%)
11710(23.4%)
0 (0.0%)

Generated by summarytools 1.0.1 (R version 4.3.2)
2024-01-17

3.4 Dealing with Invalid Values

The next step is to deal invalid values and delete some un-useful predictors.

3.4.1 Duplicate Columns

First of all we saw the possibility that al_borough and inc_borough represent the same column, let’s chek it.

identical(fire_data$al_borough, fire_data$inc_borough)
## [1] TRUE

The column `al_borough and inc_borough have the same sequence of values, so we can delete one of the two.

fire_data <- fire_data %>% select(-c(al_borough))

3.4.2 Constant value for all observations

Then we say that all observation in the dataset have the disp_resp_min_indc equal to N, let’s check again and in affermative case then we can delete both columns.

summary(fire_data$disp_resp_min_indc)
##     N     Y 
## 50000     0

All our observations have non valid disp_resp_min_indc so we could delete both the column indicator and the respective column quantity disp_resp_min_qy. However we note that also in the original dataset all the observation have the disp_resp_min_indc set to N, which is quite strange, and seems that is problem relative to the data acquisition, thus for the moment we decide to keep this time difference.

3.4.3 Validity Column Check

Now we do a quick check also on the other indicator variable inc_resp_min_indc

summary(fire_data$inc_resp_min_indc)
##     N     Y 
## 17036 32964

But here we have some observations with valid inc_resp_min_indc, and we will consider only the valid one deleting the one that has a non valid attribute.

However before doing that let’s see the distribution of inc_resp_min_qy around the borough.

ggplot(data=fire_data %>% group_by(inc_borough, inc_resp_min_indc) %>% summarise(incident_number = n()), 
       aes(x=inc_borough, y=incident_number, fill=inc_resp_min_indc)) +
  geom_bar(stat="identity", position=position_dodge()) +
  geom_text(aes(label=incident_number), vjust=1.6, color="white",
            position = position_dodge(0.9), size=3.5) +
  scale_fill_brewer(palette="Paired") +
  labs(title = "Incident Count - Borouh - Valid Response Time in Minutes", x = "Borough", y = "Incident Number", fill = "Valid Response\n Time in Minutes")
## `summarise()` has grouped output by 'inc_borough'. You can override using the
## `.groups` argument.

We can see that the number of fire incident is higher for the valid response time in minutes but, it is much interesting observe the rateo between the valid and the non valid.

And to the rateo of valid inc_resp_min_indc in each borough is:

rateo_inc_resp_min_indc <- fire_data %>% 
  group_by(inc_borough, inc_resp_min_indc) %>% 
  summarise(incident_number = n()) %>% 
  mutate(ratio=incident_number/sum(incident_number))
## `summarise()` has grouped output by 'inc_borough'. You can override using the
## `.groups` argument.
ggplot(rateo_inc_resp_min_indc, aes(fill=inc_resp_min_indc, y=ratio, x=inc_borough)) + 
  geom_bar(position="fill", stat="identity") + 
  geom_text(aes(label=scales::percent(ratio)), position=position_fill(vjust=0.5)) +
  labs(title="Borough - Rateo Incident between Valid and Invalid",
       x="Borough",
       y="Rateo Incident between Valid and Invalid",
       fill="Valid Response\nTime in Minutes")

And we can see that Staten Island has the higher number of incidents with valid inc_resp_min_indc , whereas Manhattan has the lower number, but remember that the former has the lowest number of fire incident and the latter has the higher number of incident.

Now we do an additional analysis to see if there is some find of relation between the inc_resp_min_indc and total_assigned_unit which is the sum of all the assigned units.

fire_data$total_assigned_unit <- fire_data$engines_assigned + fire_data$ladders_assigned + fire_data$others_units_assigned


ggplot(fire_data %>% drop_na(), aes(total_assigned_unit, inc_resp_min_qy)) + 
  geom_point(aes(colour = inc_resp_min_indc), na.rm = TRUE)+
  #scale_color_gradient(na.value = NA) + 
  labs(title = "Total Assigned Units - Response Time In Minutes", x = "Total Assigned Units",
        y = "Response Time In Minutes", colour = "Valid Response\n Time in Minutes")

We note that the majority of fire incident that had been assigned a single units has a high response time and the relative measure is not valid. Whereas for an higher number of total units the response time decrease and becomes valids.

ggplot(fire_data %>% filter(inc_resp_min_indc == "N") %>% drop_na()
            , aes(total_assigned_unit, inc_resp_min_qy)) + 
  geom_point(aes(colour = inc_class_group)) +
  labs(title = "Total Assigned Units - Response Time In Minutes - Incidnet Class Group", x = "Total Assigned Units", y = "Response Time In Minutes", colour = "Incident Class Groups")

A closer look to the distribution of invalid response time for each incident class group summarised by the count of incident.

irt_i_class <- arrange(fire_data %>% filter(inc_resp_min_indc == "N") %>%
  group_by(inc_class_group) %>% summarise(incident_number = n()), desc(incident_number))
irt_i_class
## # A tibble: 6 Ă— 2
##   inc_class_group        incident_number
##   <fct>                            <int>
## 1 Medical Emergencies              15116
## 2 NonMedical Emergencies            1589
## 3 NonMedical MFAs                    162
## 4 Structural Fires                    66
## 5 NonStructural Fires                 59
## 6 Medical MFAs                        44

Now we add a indicator column that tell us if the total assigned units is one or not. This would be useful in order to see possible relations.

# add an additional predictor
fire_data$tua_is_one <- as.factor(ifelse(fire_data$total_assigned_unit == 1, TRUE, FALSE))

See the relative incident count for each incident class.

irt_i_class_su <- arrange(fire_data %>% filter(inc_resp_min_indc == "N",  tua_is_one == TRUE) %>%
         group_by(inc_class_group) %>% summarise(incident_number = n()), desc(incident_number))
irt_i_class_su
## # A tibble: 5 Ă— 2
##   inc_class_group        incident_number
##   <fct>                            <int>
## 1 Medical Emergencies              14950
## 2 NonMedical Emergencies             836
## 3 NonMedical MFAs                    128
## 4 Medical MFAs                        28
## 5 NonStructural Fires                 19

And we found that majority of all observation if not almost the entire dataset records are from the Medical Emergencies with only irt_i_class[1,2] - irt_i_class_su[1,2] incident that have been assigned more than a single unit. Whereas almost all the other incidents are from the NonMedical Emergencies which in this case the spread of incident that have been assigned more than a single unit is higher, so we have to take into account this higher amount of observations. Indeed we have irt_i_class[2,2] - irt_i_class_su[2,2] NonMedical Emergencies incidents that have invalid response time and have been assigned more than a single units.

At this point we decide to view the presence or not of pattern of incident with invalid response time that belong to the Medical Emergencies class with a single units assigned, first by grouping by borough.

# filter by inc_resp_min_indc == "N" and inc_class_group == "Medical Emergencies
tuaisone <- fire_data %>% 
        filter(inc_resp_min_indc == "N", inc_class_group == "Medical Emergencies") %>%
        group_by(inc_borough, tua_is_one) %>%
        summarise(incident_number = n())
## `summarise()` has grouped output by 'inc_borough'. You can override using the
## `.groups` argument.

Plot the bat chart.

ggplot(data=tuaisone, 
       aes(x=inc_borough, y=incident_number, fill=tua_is_one)) +
  geom_bar(stat="identity", position=position_dodge()) +
  geom_text(aes(label=incident_number), vjust=1.5, color="black",
            position = position_dodge(0.9), size=3.5) +
  scale_fill_brewer(palette="Set1") +
  labs(title = "Total Assigned Units One or Not for Invaid Resp Min Time",
       x = "Borough", y = "Incident Count", fill = "Total Assigned\nUnits are One")

And we can see that Bronx, Brooklyn and Manhattan have more or less the same amount of incident with invalid response time in minutes, which have been assigned a single units.

Now we continue this analysis fo the invalid response time by changing the focus on the type of incident class (the sub-category which is more precise) that had been assigned a single total units.

ggplot(data=fire_data %>% 
        filter(inc_resp_min_indc == "N", inc_class_group == "Medical Emergencies", tua_is_one == TRUE) %>%
        group_by(inc_class, inc_borough) %>%
        summarise(incident_number = n()), 
       aes(x=inc_borough, y=incident_number, fill=inc_class)) + 
      geom_bar(stat="identity", position=position_dodge()) +
        geom_text(aes(label=incident_number), vjust=-0.5, color="black",
                  position = position_dodge(0.9), size=3) +
        scale_fill_brewer(palette="Set1") +
        labs(title = "Borough - Incident Counts - Incident Class -- for Total Assigned Units equal to 1 for Invalid Responde Time",
             x = "Borough", y = "Incident Counts", fill = "Incident Class Group")
## `summarise()` has grouped output by 'inc_class'. You can override using the
## `.groups` argument.

And we found that the majority of the incident that respect these circumstances are mostly identified as Medical - EMS Link 10-91 and Medical - PD Link 10-91.

Thanks to the 10code site we found a description of the two emergency codes:

  1. 10-91 Medical Emergency EMS - Fire Unit Not Required - To be transmitted through borough dispatcher by the responding unit when the fire Unit is canceled enroute due to EMS on scene, or EMS downgrades the job to a segment that does not require a Fire Unit response. Note: This signal shall be used only for medical emergency incidents. EMS we are sure that stands for Emergency Medical Services.

  2. 10-91 Medical Emergency PD - Fire Unit Not Required - To be transmitted through borough dispatcher by the responding unit when the fire Unit is canceled enroute due to PD on scene, or PD downgrades the job to a segment that does not require a Fire Unit response. Note: This signal shall be used only for medical emergency incidents. PD we think that stands for Police Department.

Thus we can trust this information since they make sense and consider only the observations that have inc_resp_min_indc == "Y".

Now we can look for the NonMedical Emergencies since are the second category for number of observation that by first see the distribution of its incident class.

Here we print the top 5 class which respect the following conditions: inc_resp_min_indc == "N", inc_class_group == "NonMedical Emergencies"

print(head(arrange(fire_data %>% 
        filter(inc_resp_min_indc == "N", inc_class_group == "NonMedical Emergencies") %>%
        group_by(inc_class) %>%
        summarise(incident_number = n()), desc(incident_number))))
## # A tibble: 6 Ă— 2
##   inc_class                     incident_number
##   <fct>                                   <int>
## 1 Assist Civilian - Non-Medical             828
## 2 Alarm System - Unnecessary                110
## 3 Elevator Emergency - Occupied             104
## 4 Vehicle Accident - Other                  100
## 5 Maritime Emergency                         62
## 6 Utility Emergency - Water                  60

And we found that the majority of non valid inc_resp_min_indc that are Non-Medical Emergency are from the incident class equal to Assist Civilian - Non-Medical.

ggplot(data=fire_data %>% 
          filter(inc_resp_min_indc == "N", inc_class == "Assist Civilian - Non-Medical") %>%
          group_by(inc_borough) %>%
          summarise(incident_number = n()), 
        aes(x=inc_borough, y=incident_number)) + 
      geom_bar(stat="identity", position=position_dodge()) +
      geom_text(aes(label=incident_number), vjust=1.6, color="white", position = position_dodge(0.9), size=3.5) +
      labs(title = "Assist Civilian - Non-Medical / Invalid Response / More than 1 Units", x = "Borough", y = "Incident Count")

We can see that except Staten Island the amount of incident is pretty constant.

So for a stake of sempicity we decided to trust the indicator of incident response time and thus considering only the observations that have inc_resp_min_indc == "Y".

fire_data <- fire_data %>% filter(inc_resp_min_indc == "Y")
dim(fire_data)
## [1] 32964    33

3.4.4 Does the Incident Classes are unique for each Incident Class Group?

Now we want to know how many inc_class are summarized in each inc_class_group and if there are any intersection between any incident classes. We perform this analysis to be sure that each inc_class_group is referred to a single inc_class.

unique_category <- fire_data%>% 
  group_by(inc_class) %>%
  summarise(unique_maincategory = n_distinct((inc_class_group)))

conflict <- unique_category %>% filter(unique_maincategory > 1)

if (dim(conflict)[1] == 0){
  print("There are NO conflict between main and sub-category")
} else {
  print("There are conflict between main and sub-category")
}
## [1] "There are NO conflict between main and sub-category"

So we are shure that each sub category has a unique conncetion with a single main-category.

At this point to be more clear we display each main class with each respective sub-class.

for (variable in levels(fire_data$inc_class_group)) {
  non_zero_table <- table(subset(fire_data, inc_class_group == variable)$inc_class)
  cat(variable, "\n")
  print(non_zero_table[non_zero_table != 0])
  cat("\n")
}
## Medical Emergencies 
## 
##              Medical - Assist Civilian      Medical - Breathing / Ill or Sick 
##                                     27                                   4779 
##               Medical - EMS Link 10-91 Medical - No PT Contact EMS is Onscene 
##                                   1096                                   4285 
##                Medical - PD Link 10-91     Medical - Serious Life Threatening 
##                                    868                                    366 
##              Medical - Victim Deceased 
##                                    287 
## 
## Medical MFAs 
## 
## Medical MFA - EMS Link  Medical MFA - PD Link 
##                     87                     77 
## 
## NonMedical Emergencies 
## 
##                          Alarm System - Defective 
##                                               377 
##                            Alarm System - Testing 
##                                               706 
##                        Alarm System - Unnecessary 
##                                              2735 
##                     Assist Civilian - Non-Medical 
##                                              3312 
##          Carbon Monoxide - Code 1 - Investigation 
##                                               788 
##     Carbon Monoxide - Code 2 - Incident (1-9 ppm) 
##                                               129 
## Carbon Monoxide - Code 3 - Emergency (over 9 ppm) 
##                                                88 
## Carbon Monoxide - Code 4 - No Detector Activation 
##                                                 8 
##                              Defective Oil Burner 
##                                                34 
##                                       Downed Tree 
##                                               280 
##                     Elevator Emergency - Occupied 
##                                              1850 
##                   Elevator Emergency - Unoccupied 
##                                               708 
##             Non-Medical 10-91 (Unnecessary Alarm) 
##                                               102 
##                                Odor - Other Smoke 
##                                               166 
##                           Odor - Other Than Smoke 
##                                              1317 
##                        Remove Civilian - Non-Fire 
##                                                27 
##                      Sprinkler System - Activated 
##                                                 6 
##                    Sprinkler System - Malfunction 
##                                                41 
##              Sprinkler System - Working on System 
##                                                28 
##                          Transit System Emergency 
##                                                18 
##                               Undefined Emergency 
##                                                71 
##                      Utility Emergency - Electric 
##                                               595 
##                           Utility Emergency - Gas 
##                                              1335 
##                         Utility Emergency - Steam 
##                                               137 
##                     Utility Emergency - Undefined 
##                                                 4 
##                         Utility Emergency - Water 
##                                              1157 
##                          Vehicle Accident - Other 
##                                              1443 
##               Vehicle Accident - With Extrication 
##                                                21 
## 
## NonMedical MFAs 
## 
##                Non-Medical MFA - ERS     Non-Medical MFA - ERS No Contact 
##                                  586                                    1 
##              Non-Medical MFA - Phone Non-Medical MFA - Private Fire Alarm 
##                                  701                                  223 
##             Non-Medical MFA - Verbal 
##                                    7 
## 
## NonStructural Fires 
## 
##   Abandoned Derelict Vehicle Fire                   Automobile Fire 
##                                 6                               101 
##                        Brush Fire Demolition Debris or Rubbish Fire 
##                                24                               272 
##        Manhole Fire - Blown Cover              Manhole Fire - Other 
##                                 9                                55 
##      Manhole Fire - Seeping Smoke         Other Transportation Fire 
##                               104                                14 
##    Transit System - NonStructural 
##                                59 
## 
## Structural Fires 
## 
##                                    Church Fire 
##                                             10 
##                                   Factory Fire 
##                                              1 
##                                  Hospital Fire 
##                                             18 
##         Multiple Dwelling 'A' - Compactor fire 
##                                              4 
## Multiple Dwelling 'A' - Food on the stove fire 
##                                            519 
##             Multiple Dwelling 'A' - Other fire 
##                                            168 
##                     Multiple Dwelling 'B' Fire 
##                                             85 
##                 Other Commercial Building Fire 
##                                            184 
##                     Other Public Building Fire 
##                                              4 
##                          Private Dwelling Fire 
##                                            412 
##                                    School Fire 
##                                             31 
##                                     Store Fire 
##                                              9 
##                    Transit System - Structural 
##                                              1 
##                Under Contruction / Vacant Fire 
##                                              1

4 Dealing With Missing Data

At this point is essential to deal with NA values, trying to find the presence of possible relation with predictors. First things first let’s recap the number of NA values for each columns that we have at the moment.

colSums(is.na(fire_data))
##                     id              al_number            al_location 
##                      0                      0                      0 
##            inc_borough                zipcode               pol_prec 
##                      0                   2197                   2197 
##          city_con_dist             commu_dist          commu_sc_dist 
##                   2197                   2197                   2198 
##              cong_dist         al_source_desc          al_index_desc 
##                   2197                      0                      0 
##       highest_al_level              inc_class        inc_class_group 
##                      0                      0                      0 
##       disp_resp_min_qy     first_ass_datetime     first_act_datetime 
##                      0                      0                     41 
## first_onscene_datetime     inc_close_datetime     disp_resp_min_indc 
##                      0                      0                      0 
##      inc_resp_min_indc        inc_resp_min_qy      inc_travel_min_qy 
##                      0                      0                      0 
##       engines_assigned       ladders_assigned  others_units_assigned 
##                      4                      4                      4 
##       emergency_min_qy               day_type           working_hour 
##                      0                      0                      0 
##            time_of_day    total_assigned_unit             tua_is_one 
##                      0                      4                      4

4.1 Geographical Columns

Here we will check if there is a pattern on the absence of values in the following predictors: zipcode, pol_prec, city_con_dist, commu_dist, commu_sc_dist and cong_dist.

na_locations <- fire_data %>%
  filter(is.na(zipcode) | is.na(pol_prec) | is.na(city_con_dist) | is.na(commu_dist) | is.na(commu_sc_dist) | is.na(cong_dist))
ggplot(data=na_locations %>% 
        group_by(inc_class_group, inc_borough) %>%
        summarise(incident_number = n()), 
       aes(x=inc_borough, y=incident_number, fill=inc_class_group)) + geom_bar(stat="identity", position=position_dodge()) +
        geom_text(aes(label=incident_number), vjust=-0.5, color="black",
                  position = position_dodge(0.9), size=3.5) +
        scale_fill_brewer(palette="Set1") +
        labs(title = "NA Locations Columns", x = "Borough", y = "Incident Count", fill = "Incident Class Group")
## `summarise()` has grouped output by 'inc_class_group'. You can override using
## the `.groups` argument.

By the Bar Chart we note that the majority of observations that have at least one of the location predictors to NA are of the incident class group NonMedical Emergency, Medical Emergencies and Non Medical MFAs.

Here we verify if the proportion of NA is equally distributed around each borough.

table(na_locations$inc_borough) / table(fire_data$inc_borough)
## 
##         Bronx      Brooklyn     Manhattan        Queens Staten Island 
##    0.07104538    0.04694547    0.07662157    0.07700328    0.07523697

And we found that Brooklyn has a lower percentage of NA location observation respect the others that are constant between them.

Thus we compute the same process but this time looking into the incident class group, like before we would like to have a constant distribution among all classes.

table(na_locations$inc_class_group) / table(fire_data$inc_class_group)
## 
##    Medical Emergencies           Medical MFAs NonMedical Emergencies 
##             0.03988726             0.10975610             0.05691243 
##        NonMedical MFAs    NonStructural Fires       Structural Fires 
##             0.40447958             0.14906832             0.00552868

However by looking at the proportion with the original dataset around the 40% of the whole observations of type NonMedical MFAs have at least one of the location columns to NA. Let’s investigate.

In the original dataset this is the distribution of incident class for the NonMedical MFA incident class group.

fd_nm_mfa_cl <- table(subset(fire_data, inc_class_group == "NonMedical MFAs")$inc_class)
fd_nm_mfa_bro <- table(subset(fire_data, inc_class_group == "NonMedical MFAs")$inc_borough)

fd_nm_mfa_cl <- fd_nm_mfa_cl[fd_nm_mfa_cl != 0]
fd_nm_mfa_cl
## 
##                Non-Medical MFA - ERS     Non-Medical MFA - ERS No Contact 
##                                  586                                    1 
##              Non-Medical MFA - Phone Non-Medical MFA - Private Fire Alarm 
##                                  701                                  223 
##             Non-Medical MFA - Verbal 
##                                    7

And this is the distribution of incident class for the NonMedical MFA incident class group, but this time w.r.t. the subset of incidents that have at least one locaton column to NA.

na_nm_mfa_cl <- table(subset(na_locations, inc_class_group == "NonMedical MFAs")$inc_class)
na_nm_mfa_bro <- table(subset(na_locations, inc_class_group == "NonMedical MFAs")$inc_borough)

na_nm_mfa_cl <- na_nm_mfa_cl[names(fd_nm_mfa_cl)]
na_nm_mfa_cl
## 
##                Non-Medical MFA - ERS     Non-Medical MFA - ERS No Contact 
##                                  573                                    1 
##              Non-Medical MFA - Phone Non-Medical MFA - Private Fire Alarm 
##                                   38                                    2 
##             Non-Medical MFA - Verbal 
##                                    0

And we can clearly see that the observations of category Non-Medical MFA - ERS are the main.

na_nm_mfa_cl / fd_nm_mfa_cl
## 
##                Non-Medical MFA - ERS     Non-Medical MFA - ERS No Contact 
##                           0.97781570                           1.00000000 
##              Non-Medical MFA - Phone Non-Medical MFA - Private Fire Alarm 
##                           0.05420827                           0.00896861 
##             Non-Medical MFA - Verbal 
##                           0.00000000

And more interestingly the 97% of all the Non-Medical MFA - ERS observations in the entire dataset have one of the location attribute equal to NA.

na_nm_mfa_bro / fd_nm_mfa_bro
## 
##         Bronx      Brooklyn     Manhattan        Queens Staten Island 
##     0.4676056     0.3075221     0.3894472     0.3733333     0.7954545

And from here we can see that about the 78% of the observations that are NonMedical - MFAs that have at least one district column attribute to NA are from the RICHMOND / STATEN ISLAND. Also BRONX has about half of the NonMedical - MFAs observations that have at least one district column to NA.

ggplot(data=na_locations %>%
         filter(inc_class == "Non-Medical MFA - ERS") %>%
         group_by(inc_borough) %>%
         summarise(incident_count = n()), 
      aes(x=inc_borough, y=incident_count)) + 
      geom_bar(stat="identity", position=position_dodge()) +
      geom_text(aes(label=incident_count), vjust=1.6, color="white", position = position_dodge(0.9), size=3.5) +
      labs(title = "Non-Medical MFA - ERS", x = "Borough", y = "Incident Count")

Finally we have that we have a lower number of Non-Medical MFA - ERS for Queens and Staten Istand.

So concluding NonMedical MFA stands for Non Medical - Medical First Aid, unfortunately we are not able to find the meaning of ERS, even if we think that could be possible be connected with something regarding the respiratory system like ERS: Emergency Respiratory System. And thus we think that these observations have NA location column by the same reason of the previously discussed 10-91 Medical Emergency. However these are all suppositions, maybe by contacting the NYC Firefighters Department by mail should make things much clear.

Again for stake of semplicity we have assumed that the observations that have at least one NA locations are randomly spreaded among the dataset.

4.2 Assigned Units Column

print(fire_data %>%
  filter(is.na(engines_assigned) | is.na(ladders_assigned) | is.na(others_units_assigned)))
##                      id al_number                    al_location inc_borough
## 1 230905-Q4545-001-0683      4545                 53 AVE & 69 ST      Queens
## 2 230914-Q1014-001-0693      1014 CENTRAL AVE & VIRGINIA ST/LCFD      Queens
## 3 230918-Q9643-002-0752      9643           JAMAICA AVE & 114 ST      Queens
## 4 230919-M0684-002-0995       684                1 AVE & E 30 ST   Manhattan
##   zipcode pol_prec city_con_dist commu_dist commu_sc_dist cong_dist
## 1   11378      104            30        405            24         6
## 2   11691      101            31        414            27         5
## 3   11418      102            29        409            27         5
## 4   10016       13             4        106             2        12
##   al_source_desc  al_index_desc highest_al_level
## 1            EMS  Initial Alarm      First Alarm
## 2        EMS-911 DEFAULT RECORD      First Alarm
## 3          PHONE  Initial Alarm      First Alarm
## 4         Others  Initial Alarm      First Alarm
##                                inc_class        inc_class_group
## 1 Medical - No PT Contact EMS is Onscene    Medical Emergencies
## 2                Medical - PD Link 10-91    Medical Emergencies
## 3          Assist Civilian - Non-Medical NonMedical Emergencies
## 4               Vehicle Accident - Other NonMedical Emergencies
##   disp_resp_min_qy  first_ass_datetime  first_act_datetime
## 1        0.2666667 2023-09-05 16:32:44 2023-09-05 16:33:34
## 2        0.1000000 2023-09-14 22:57:58 2023-09-14 22:58:16
## 3        0.5833333 2023-09-18 23:44:39 2023-09-18 23:45:11
## 4        0.7000000 2023-09-19 16:35:57 2023-09-19 16:36:07
##   first_onscene_datetime  inc_close_datetime disp_resp_min_indc
## 1    2023-09-05 16:37:22 2023-09-05 16:40:41                  N
## 2    2023-09-14 23:02:24 2023-09-14 23:05:08                  N
## 3    2023-09-18 23:50:08 2023-09-19 00:07:10                  N
## 4    2023-09-19 16:46:50 2023-09-19 16:56:33                  N
##   inc_resp_min_indc inc_resp_min_qy inc_travel_min_qy engines_assigned
## 1                 Y        4.900000          4.633333               NA
## 2                 Y        4.533333          4.433333               NA
## 3                 Y        6.066667          5.483333               NA
## 4                 Y       11.583333         10.883333               NA
##   ladders_assigned others_units_assigned emergency_min_qy day_type working_hour
## 1               NA                    NA         3.316667  Weekday         TRUE
## 2               NA                    NA         2.733333  Weekday        FALSE
## 3               NA                    NA        17.033333  Weekday        FALSE
## 4               NA                    NA         9.716667  Weekday         TRUE
##   time_of_day total_assigned_unit tua_is_one
## 1   Afternoon                  NA       <NA>
## 2     Evening                  NA       <NA>
## 3     Evening                  NA       <NA>
## 4   Afternoon                  NA       <NA>

We can easily remove this observations, since it not appear any pattern and are only 4.

4.3 First Act Datetime

Now we look into First Act Datetime and see if there is any pattern with the other columns.

na_first_act_datetime <- fire_data %>% filter(is.na(first_act_datetime))
print(arrange(na_first_act_datetime %>% group_by(inc_class, inc_borough) %>% summarise(incident_count = n()), desc(incident_count)))
## `summarise()` has grouped output by 'inc_class'. You can override using the
## `.groups` argument.
## # A tibble: 27 Ă— 3
## # Groups:   inc_class [15]
##    inc_class                              inc_borough incident_count
##    <fct>                                  <fct>                <int>
##  1 Assist Civilian - Non-Medical          Brooklyn                 4
##  2 Medical - Breathing / Ill or Sick      Brooklyn                 4
##  3 Medical - Breathing / Ill or Sick      Manhattan                4
##  4 Medical - No PT Contact EMS is Onscene Queens                   3
##  5 Non-Medical MFA - ERS                  Brooklyn                 3
##  6 Alarm System - Unnecessary             Brooklyn                 2
##  7 Assist Civilian - Non-Medical          Bronx                    1
##  8 Assist Civilian - Non-Medical          Queens                   1
##  9 Demolition Debris or Rubbish Fire      Brooklyn                 1
## 10 Downed Tree                            Manhattan                1
## # ℹ 17 more rows
ggplot(data=na_first_act_datetime %>% 
        group_by(inc_class_group, inc_borough) %>%
        summarise(incident_count = n()), 
    aes(x=inc_borough, y=incident_count, fill=inc_class_group)) +
    geom_bar(stat="identity", position=position_dodge()) +
    geom_text(aes(label=incident_count), vjust=1.6, color="white", position = position_dodge(0.9), size=3.5) +
    labs(title = "NA First Act Date", x = "Borough", y = "Incident Count", fill = "Incident Class Group")
## `summarise()` has grouped output by 'inc_class_group'. You can override using
## the `.groups` argument.

Seems to be random and thus there is no pattern that motivate the presence of NA values in first_act_datetime.

4.4 Performing na.omit

At this point we can omit the NA values.

fire_data_new <- na.omit(fire_data)

And the un-usefull predictors including disp_resp_min_indc and inc_travel_min_qy by since these two information are contained in inc_resp_min_qy.

fire_data_new <- fire_data_new %>% select(-c(zipcode, pol_prec, city_con_dist, commu_dist, al_location,
                                             commu_sc_dist, first_ass_datetime, first_act_datetime,
                                             first_onscene_datetime, inc_close_datetime, inc_resp_min_indc,
                                             disp_resp_min_indc, inc_travel_min_qy, disp_resp_min_qy, 
                                             id, al_number, inc_class))

We decide to remove also the incident class since it contains too much category values which are summarised in the incident class group.

print(dfSummary(fire_data_new, 
                plain.ascii  = FALSE, 
                style        = "multiline", 
                headings     = FALSE,
                graph.magnif = 0.8, 
                valid.col    = FALSE),
                method = 'render')
No Variable Stats / Values Freqs (% of Valid) Graph Missing
1 inc_borough [factor]
1. Bronx
2. Brooklyn
3. Manhattan
4. Queens
5. Staten Island
6406(20.8%)
9280(30.2%)
7294(23.7%)
6188(20.1%)
1560(5.1%)
0 (0.0%)
2 cong_dist [factor]
1. 3
2. 5
3. 6
4. 7
5. 8
6. 9
7. 10
8. 11
9. 12
10. 13
[ 3 others ]
458(1.5%)
2420(7.9%)
1532(5.0%)
2907(9.5%)
3135(10.2%)
2257(7.3%)
3212(10.5%)
2073(6.7%)
2909(9.5%)
3504(11.4%)
6321(20.6%)
0 (0.0%)
3 al_source_desc [factor]
1. PHONE
2. EMS
3. EMS-911
4. CLASS-3
5. Others
13465(43.8%)
7369(24.0%)
4349(14.2%)
4817(15.7%)
728(2.4%)
0 (0.0%)
4 al_index_desc [factor]
1. DEFAULT RECORD
2. Initial Alarm
3. Others
2979(9.7%)
27631(89.9%)
118(0.4%)
0 (0.0%)
5 highest_al_level [factor]
1. First Alarm
2. NonFirst Alarm
30626(99.7%)
102(0.3%)
0 (0.0%)
6 inc_class_group [factor]
1. Medical Emergencies
2. Medical MFAs
3. NonMedical Emergencies
4. NonMedical MFAs
5. NonStructural Fires
6. Structural Fires
11222(36.5%)
146(0.5%)
16471(53.6%)
904(2.9%)
547(1.8%)
1438(4.7%)
0 (0.0%)
7 inc_resp_min_qy [numeric]
Mean (sd) : 5.9 (2.7)
min ≤ med ≤ max:
0.3 ≤ 5.5 ≤ 58.9
IQR (CV) : 2.5 (0.5)
1147 distinct values 0 (0.0%)
8 engines_assigned [integer]
Mean (sd) : 1.1 (0.9)
min ≤ med ≤ max:
0 ≤ 1 ≤ 19
IQR (CV) : 0 (0.8)
15 distinct values 0 (0.0%)
9 ladders_assigned [integer]
Mean (sd) : 0.8 (0.8)
min ≤ med ≤ max:
0 ≤ 1 ≤ 15
IQR (CV) : 1 (1)
12 distinct values 0 (0.0%)
10 others_units_assigned [integer]
Mean (sd) : 0.4 (0.9)
min ≤ med ≤ max:
0 ≤ 0 ≤ 32
IQR (CV) : 1 (2.3)
22 distinct values 0 (0.0%)
11 emergency_min_qy [numeric]
Mean (sd) : 17.1 (19.3)
min ≤ med ≤ max:
0 ≤ 11.9 ≤ 596.4
IQR (CV) : 12.4 (1.1)
4040 distinct values 0 (0.0%)
12 day_type [factor]
1. Weekday
2. Weekend
22379(72.8%)
8349(27.2%)
0 (0.0%)
13 working_hour [factor]
1. FALSE
2. TRUE
12677(41.3%)
18051(58.7%)
0 (0.0%)
14 time_of_day [factor]
1. Night
2. Morning
3. Afternoon
4. Evening
4487(14.6%)
8370(27.2%)
10627(34.6%)
7244(23.6%)
0 (0.0%)
15 total_assigned_unit [integer]
Mean (sd) : 2.3 (2.2)
min ≤ med ≤ max:
1 ≤ 1 ≤ 66
IQR (CV) : 2 (1)
34 distinct values 0 (0.0%)
16 tua_is_one [factor]
1. FALSE
2. TRUE
12940(42.1%)
17788(57.9%)
0 (0.0%)

Generated by summarytools 1.0.1 (R version 4.3.2)
2024-01-17

5 Data Visaulisation

In this section we will have a look on additional data visualisation in order to better understand how the predictors behaves. We divide this section in multiple subsection each focus on a relative subset of columns

See the final columns that we have:

names(fire_data_new)
##  [1] "inc_borough"           "cong_dist"             "al_source_desc"       
##  [4] "al_index_desc"         "highest_al_level"      "inc_class_group"      
##  [7] "inc_resp_min_qy"       "engines_assigned"      "ladders_assigned"     
## [10] "others_units_assigned" "emergency_min_qy"      "day_type"             
## [13] "working_hour"          "time_of_day"           "total_assigned_unit"  
## [16] "tua_is_one"

We divide the data visualisation in the following subsections:

  1. Location Information
  2. Alarm Information
  3. Assigned Units Information
  4. Time Information
  5. Day Information

In each section analysis we will use both response: inc_resp_min_qy adn emergency_time_qy.

5.1 Location Information

Here we study the inc_borough and cong_dist columns for both responses.

5.1.1 Incident Borough

#bp1 <- 
ggplot(fire_data_new,
       aes(x = inc_borough, y = inc_resp_min_qy, color = inc_class_group)) +
  geom_boxplot() +
  labs(title = "Borough - Incident Minutes Response Time - Incident Class",
       x = "Borough", y = "Incident Minutes Response Time", color = "Incident Class")

Here we can see that the response time is pretty the same around all borough for each incident class, indeed there is a slightly fast response respect to the other class for the Structural Fires as expected. The the highest number of outliers are from the Medical and NonMedical Emergencies.

ggplot(fire_data_new,
       aes(x = inc_borough, y = emergency_min_qy, color = inc_class_group)) +
  geom_boxplot() +
  labs(title = "Borough - Emergency Minutes Time - Incident Class",
       x = "Borough", y = "Emergency Minutes Time", color = "Incident Class")

Like before the Structural Fires require lots of time but only in some drastic cases identified by the outliers since on average they require even less time respect other categories.

5.1.2 Congressional District

bp1 <- ggplot(fire_data_new,
       aes(x = cong_dist, y = inc_resp_min_qy)) +
  geom_boxplot() +
  labs(title = "Cong District - EmergencyMin Time",
       x = "Cong District", y = "Emergency Min Time")

bp2 <- ggplot(fire_data_new,
       aes(x = cong_dist, y = emergency_min_qy)) +
  geom_boxplot() +
  labs(title = "Cong District - Emergency Min Time",
       x = "Cong District", y = "Emergency Minutes Time")

grid.arrange(bp1, bp2, ncol = 2)

Here we can’t see any particular relevent information since on average we have the same distribution of time for all congressional district.

5.2 Alarm Information

Here we study the al_source_desc, al_index_desc and highest_al_level columns for both responses.

5.2.1 Alarm Source Description

ggplot(fire_data_new,
       aes(x = al_source_desc, y = inc_resp_min_qy, color = inc_class_group)) +
  geom_boxplot() +
  labs(title = "Alarm Source Description - Incident Minutes Response Time - Incident Class",
       x = "Alarm Source Description", y = "Incident Minutes Response Time", color = "Incident Class")

With this boxplot we can understand that there are many outliers for emergencies called via phone, in particular NonMedical Emergencies. Also in both EMS and EMS-911 there are many outliers for the Medical Emergencies. As expected then for emergencies called via EMS-911 we do not find both fires eergencies and non medical mfas.

ggplot(fire_data_new,
       aes(x = al_source_desc, y = emergency_min_qy, color = inc_class_group)) +
  geom_boxplot() +
  labs(title = "Alarm Source Description - Emergency Minutes Time - Incident Class",
       x = "Alarm Source Description", y = "Emergency Minutes Time", color = "Incident Class")

Now the first thing that we see is the high amount of outliers for the emergencies called via phone that are structural fires, indeed on average the emergency time is higher for that alarm source category. An other interesting fact is for CLASS-3 the NonMedical MFAs have a big inter-quantile range, sice we have few information for that particular class.

5.2.2 Alarm Index Description

ggplot(fire_data_new,
       aes(x = al_index_desc, y = inc_resp_min_qy, color = inc_class_group)) +
  geom_boxplot() +
  labs(title = "Alarm Index Description - Incident Minutes Response Time - Incident Class",
       x = "Alarm Index Description", y = "Incident Minutes Response Time", color = "Incident Class")

Again we can see many outliers for the two major class, and for the Initial alarm we do not have any Medical and NonMedical MFAs emergencies. For the Outlier class like in the previous cans the inter-quantile range for the NonStructural Fires is big since in the merged category we do no have many observations.

ggplot(fire_data_new,
       aes(x = al_index_desc, y = emergency_min_qy, color = inc_class_group)) +
  geom_boxplot() +
  labs(title = "Alarm Index Description - Emergency Minutes Time - Incident Class",
       x = "Alarm Index Description", y = "Emergency Minutes Time", color = "Incident Class")

Interestingly here we can see that around all the emergency with index level DEFAULT RECORD have very thin inter-quantile range with also low outliers, respect to the Initial Alarm and Others category level.

5.2.3 Highest Alarm Level

ggplot(fire_data_new,
       aes(x = highest_al_level, y = inc_resp_min_qy, color = inc_class_group)) +
  geom_boxplot() +
  labs(title = "Highest Alarm Level - Incident Minutes Response Time - Incident Class",
       x = "Highest Alarm Level", y = "Incident Minutes Response Time", color = "Incident Class")

Pretty all emergency of First Alarm have the same amount of response time with particular outliers for Medical and Non Medical Emergencies and also for NonMedical MFAs, whereas for NonFirst Alarm since we have few observation there for the non Structural Fire we have a big range of interquantile.

ggplot(fire_data_new,
       aes(x = highest_al_level, y = emergency_min_qy, color = inc_class_group)) +
  geom_boxplot() +
  labs(title = "Highest Alarm Level - Emergency Minutes Time - Incident Class",
       x = "Highest Alarm Level", y = "Emergency Minutes Time", color = "Incident Class")

Similarly to the previous subsection here all the emergency with highest alarm level to First Alarm have a low emergency time in particular Medical and NonMedical MFAs, again we have a significative number of outliers for this class. Regarding NonFirst Level we have few observations and that is why the emergency time is so different respect to the other class. Even if the longest time taken emergencies are a structural fire with highest alarm level NonFirst Alarm.

5.3 Assigned Units Information

Here we study the engines_assigned, ladders_assigned, others_units_assigned, total_assigned_unit and tua_is_one columns for both responses.

5.3.1 Eginees Assigned

#fire_data_new$engines_assigned <- as.numeric(fire_data_new$engines_assigned)
ggplot(fire_data_new,
       aes(x = engines_assigned, y = inc_resp_min_qy, color = inc_class_group)) +
  #geom_boxplot()
  geom_point(aes(color = inc_class_group)) +
  labs(title = "Engines Assigned - Incident Minutes Response Time - Incident Class",
       x = "Engines Assigned", y = "Incident Minutes Response Time", color = "Incident Class")

ggplot(fire_data_new,
       aes(x = engines_assigned, y = emergency_min_qy, color = inc_class_group)) +
  geom_point(aes(color = inc_class_group)) +
  labs(title = "Engines Assigned - Emergency Minutes Time - Incident Class",
       x = "Engines Assigned", y = "Emergency Minutes Time", color = "Incident Class")

5.3.2 Ladders Assigned

ggplot(fire_data_new,
       aes(x = ladders_assigned, y = inc_resp_min_qy, color = inc_class_group)) +
  geom_point(aes(color = inc_class_group)) +
  labs(title = "Ladders Assigned - Incident Minutes Response Time - Incident Class",
       x = "Ladders Assigned", y = "Incident Minutes Response Time", color = "Incident Class")

ggplot(fire_data_new,
       aes(x = ladders_assigned, y = emergency_min_qy, color = inc_class_group)) +
  geom_point(aes(color = inc_class_group))  +
  labs(title = "Ladders Assigned - Emergency Minutes Time - Incident Class",
       x = "Ladders Assigned", y = "Emergency Minutes Time", color = "Incident Class")

5.3.3 Others Units Assigned

ggplot(fire_data_new,
       aes(x = others_units_assigned, y = inc_resp_min_qy, color = inc_class_group)) +
  geom_point(aes(color = inc_class_group)) +
  labs(title = "Others Units Assigned - Incident Minutes Response Time - Incident Class",
       x = "Others Units Assigned", y = "Incident Minutes Response Time", color = "Incident Class")

ggplot(fire_data_new,
       aes(x = others_units_assigned, y = emergency_min_qy, color = inc_class_group)) +
  geom_point(aes(color = inc_class_group)) +
  labs(title = "Others Units Assigned - Emergency Minutes Time - Incident Class",
       x = "Others Units Assigned", y = "Emergency Minutes Time", color = "Incident Class")

5.3.4 Total Units Assigned

ggplot(fire_data_new,
       aes(x = total_assigned_unit, y = inc_resp_min_qy, color = inc_class_group)) +
  geom_point(aes(color = inc_class_group)) +
  labs(title = "Total Units Assigned - Incident Minutes Response Time - Incident Class",
       x = "Total Units Assigned", y = "Incident Minutes Response Time", color = "Incident Class")

Whereas here as described in a previous similar chart, the total assigned units and the incident response time in minutes appear to be inversely proportional, with the contradistinction of structural fires that have many assigned units and low response time and Non Medical Emergencies that have low assigned units and high response time.

ggplot(fire_data_new,
       aes(x = total_assigned_unit, y = emergency_min_qy, color = inc_class_group)) +
  geom_point(aes(color = inc_class_group)) +
  labs(title = "Total Units Assigned - Emergency Minutes Time - Incident Class",
       x = "Total Units Assigned", y = "Emergency Minutes Time", color = "Incident Class")

Here we can see that the Structural Fire incident have lots of variance directly and seems be a directly proportional relationship between the total assigned units and the Emergency Time. For the NonMedical Emergencies they are clustered and then for the Medical Emergencies we can see that they have been mostly assigend a single units.

5.3.5 Total Assigned Units is One

ggplot(fire_data_new,
       aes(x = tua_is_one, y = inc_resp_min_qy, color = inc_class_group)) +
  geom_boxplot() +
  labs(title = "Total Assigned Unit is 1 - Incident Minutes Response Time - Incident Class",
       x = "Total Assigned Unit is 1", y = "Incident Minutes Response Time", color = "Incident Class")

ggplot(fire_data_new,
       aes(x = tua_is_one, y = emergency_min_qy, color = inc_class_group)) +
  geom_boxplot() +
  labs(title = "Total Assigned Unit is 1 - Emergency Minutes Time - Incident Class",
       x = "Total Assigned Unit is 1", y = "Emergency Minutes Time", color = "Incident Class")

5.4 Time Information

Here we study the relation of inc_resp_min_qy against emergency_time_qy.

ggplot(fire_data_new,
       aes(x = inc_resp_min_qy, y = emergency_min_qy, color = inc_class_group)) +
  geom_point(aes(color = inc_class_group)) +
  labs(title = "Incident Minutes Response Time - Emergency Minutes Time - Incident Class",
       x = "Incident Minutes Response Time", y = "Emergency Minutes Time", color = "Incident Class")

Here we can see many different distribution with the relation that the less the response time is the higher is the time taken to close the emergency.

Since we can’t understand much by this chart we decide to plot the density of each incident class group for both response.

ggplot(fire_data_new, aes(x = inc_resp_min_qy, fill = inc_class_group)) +
  geom_density(alpha = 0.3) + xlim(0, 20) +
  labs(title = " Density Incident Minutes Response Time - Incident Class",
       x = "Incident Minutes Response Time", y = "Density", fill = "Incident Class")
## Warning: Removed 129 rows containing non-finite values (`stat_density()`).

In this density chart we can see that the distribution of Incident Response Time for each incident class seems to behave like a Log-Norm distribution. We limit the x axis in order to be able to understand the distribution otherwise it will result in a very thin curve so not much easy to understand.

ggplot(fire_data_new, aes(x = emergency_min_qy, fill = inc_class_group)) +
  geom_density(alpha = 0.5) + xlim(0, 75) +
  scale_fill_brewer(palette="Set1") +
  labs(title = "Density Emergency Response Time - Incident Class",
       x = "Emergency Response Time", y = "Density", fill = "Incident Class")
## Warning: Removed 482 rows containing non-finite values (`stat_density()`).

Similarly as before also here the distribution of Incident Time for each incident class seems to behave like a Log-Norm distribution. Again lke before we limit the x axis in order to be able to understand the distribution otherwise it will result in a very thin curve so not much easy to understand.

5.5 Day Information

Here we study the day_type, working_hour and time_of_day columns for both responses.

5.5.1 Day Type

ggplot(fire_data_new,
       aes(x = day_type, y = inc_resp_min_qy, color = inc_class_group)) +
  geom_boxplot() +
  labs(title = "Day Type - Incident Minutes Response Time - Incident Class",
       x = "Day Type", y = "Incident Minutes Response Time", color = "Incident Class")

Speaking about the day we can’t see any relevant pattern here except that it appears in the weekend we have a lower number of outliers for all the incident class.

ggplot(fire_data_new,
       aes(x = day_type, y = emergency_min_qy, color = inc_class_group)) +
  geom_boxplot() +
  labs(title = "Day Type - Emergency Minutes Time - Incident Class",
       x = "Day Type", y = "Emergency Minutes Time", color = "Incident Class")

Here there is any kind of patters that we can recondunct to the day type for the emergency time taken.

5.5.2 Working Hour

ggplot(fire_data_new,
       aes(x = working_hour, y = inc_resp_min_qy, color = inc_class_group)) +
  geom_boxplot() +
  labs(title = "Working Hour - Incident Minutes Response Time - Incident Class",
       x = "Working Hour", y = "Incident Minutes Response Time", color = "Incident Class")

Even for the working hourwe can’t see any pattern regarding the incident response time.

ggplot(fire_data_new,
       aes(x = working_hour, y = emergency_min_qy, color = inc_class_group)) +
  geom_boxplot() +
  labs(title = "Working Hour - Emergency Minutes Time - Incident Class",
       x = "Working Hour", y = "Incident Minutes Time", color = "Incident Class")

And even for the emergency time taken.

5.5.3 Day Time

ggplot(fire_data_new,
       aes(x = time_of_day, y = inc_resp_min_qy, color = inc_class_group)) +
  geom_boxplot() +
  labs(title = "Day Time - Incident Minutes Response Time - Incident Class",
       x = "Day Time", y = "Incident Minutes Response Time", color = "Incident Class")

Here we can’t see any relevant pattern except the fact that we have an higher number of outliers in the afternoonfor the Medical, NonMedical Emergencies and Structural Fires.

ggplot(fire_data_new,
       aes(x = time_of_day, y = emergency_min_qy, color = inc_class_group)) +
  geom_boxplot() +
  labs(title = "Day Time - Emergency Minutes Time - Incident Class",
       x = "Day Time", y = "Incident Minutes Time", color = "Incident Class")

Here interestingly we can see that the longest time taken emergencies are Structural Fires and they happend in the Nighe and on the Morning. Except this we do not have any pattern.

5.5.4 Maps Visualization

In this section we plot additional data visualization focus on the geographical visualization of the New York borough and congressional distirct with relative predictors. In order to do so we load an additional datasets:

  1. The fdny-firehouse-listing.csv is a dataset that includes the geographical informations of every firefighter stations in the NYC, including again latitude and longitude.
  2. NYC_BoroughBoundaries.geojson is the spatial dataframe that includes the geometry of each boroughs
  3. Congressiona_Districts.geojson is the spatial dataframe that includes the geometry of each congressional districts
firefighter_stations <- read.csv("datasets/fdny-firehouse-listing.csv")

head(firefighter_stations)
##                                              FacilityName       FacilityAddress
## 1                                      Engine 4/Ladder 15       42 South Street
## 2                                     Engine 10/Ladder 10    124 Liberty Street
## 3                                                Engine 6     49 Beekman Street
## 4 Engine 7/Ladder 1/Battalion 1/Manhattan Borough Command  100-104 Duane Street
## 5                                                Ladder 8 14 North Moore Street
## 6                                       Engine 9/Ladder 6       75 Canal Street
##     Borough Postcode Latitude Longitude Community.Board Community.Council
## 1 Manhattan    10005 40.70347 -74.00754               1                 1
## 2 Manhattan    10006 40.71007 -74.01252               1                 1
## 3 Manhattan    10038 40.71005 -74.00525               1                 1
## 4 Manhattan    10007 40.71546 -74.00594               1                 1
## 5 Manhattan    10013 40.71976 -74.00668               1                 1
## 6 Manhattan    10002 40.71521 -73.99290               3                 1
##   Census.Tract     BIN        BBL
## 1            7 1000867 1000350001
## 2           13 1075700 1000520022
## 3         1501 1001287 1000930030
## 4           33 1001647 1001500025
## 5           33 1002150 1001890035
## 6           16 1003898 1003000030
##                                                                           NTA
## 1 Battery Park City-Lower Manhattan                                          
## 2 Battery Park City-Lower Manhattan                                          
## 3 Battery Park City-Lower Manhattan                                          
## 4 SoHo-TriBeCa-Civic Center-Little Italy                                     
## 5 SoHo-TriBeCa-Civic Center-Little Italy                                     
## 6 Chinatown
summary(firefighter_stations)
##  FacilityName       FacilityAddress      Borough             Postcode    
##  Length:218         Length:218         Length:218         Min.   :10001  
##  Class :character   Class :character   Class :character   1st Qu.:10304  
##  Mode  :character   Mode  :character   Mode  :character   Median :11103  
##                                                           Mean   :10784  
##                                                           3rd Qu.:11231  
##                                                           Max.   :11695  
##                                                           NA's   :5      
##     Latitude       Longitude      Community.Board  Community.Council
##  Min.   :40.51   Min.   :-74.24   Min.   : 1.000   Min.   : 1.00    
##  1st Qu.:40.66   1st Qu.:-73.99   1st Qu.: 3.000   1st Qu.:12.00    
##  Median :40.72   Median :-73.94   Median : 6.000   Median :27.00    
##  Mean   :40.72   Mean   :-73.94   Mean   : 7.075   Mean   :25.63    
##  3rd Qu.:40.77   3rd Qu.:-73.89   3rd Qu.:11.000   3rd Qu.:38.00    
##  Max.   :40.89   Max.   :-73.72   Max.   :84.000   Max.   :51.00    
##  NA's   :5       NA's   :5        NA's   :5        NA's   :5        
##   Census.Tract         BIN               BBL                NTA           
##  Min.   :     1   Min.   :1000867   Min.   :1.000e+09   Length:218        
##  1st Qu.:   129   1st Qu.:2003268   1st Qu.:2.025e+09   Class :character  
##  Median :   275   Median :3064786   Median :3.025e+09   Mode  :character  
##  Mean   :  5950   Mean   :2900421   Mean   :2.850e+09                     
##  3rd Qu.:   800   3rd Qu.:4090228   3rd Qu.:4.033e+09                     
##  Max.   :157902   Max.   :5154879   Max.   :5.080e+09                     
##  NA's   :5        NA's   :5         NA's   :5

We now start with the firefighter stations dataset. By first making a copy of the fire_data_new and setting the borough from the firefighter_stations dataset to factor in order to be easily merged with the copied fire_data dataset.

# make a copy of the fire_data
fire_data_for_ffs <- fire_data_new

fire_data_for_ffs <- fire_data_for_ffs %>% rename(borough = inc_borough)

firefighter_stations$Borough <- as.factor(firefighter_stations$Borough)
firefighter_stations <- firefighter_stations %>% rename(borough = Borough)

# remove the na values from firefighter_stations
firefighter_stations <- na.omit(firefighter_stations)

Now we want to get the number of firefighter station for each borough.

stations_borough <- firefighter_stations %>%
                    group_by(borough) %>%
                    summarise(number_of_stations = n())

Now we want to get a summary of the incident count, the number of station and the incident per station of each borough in order to have a general view of the New York City situation.

This step is done twice we have to group both for borough and then for congressional district.

count_inc_brough <- fire_data_for_ffs %>% group_by(borough) %>% summarise(incident_count = n())
count_inc_cdist <- fire_data_for_ffs %>% group_by(cong_dist) %>% summarise(incident_count = n())

stations_borough$incident_per_station <- round(count_inc_brough$incident_count / stations_borough$number_of_stations, digits = 3)

count_inc_brough <- merge(count_inc_brough, stations_borough, by="borough")

count_inc_brough
##         borough incident_count number_of_stations incident_per_station
## 1         Bronx           6406                 34              188.412
## 2      Brooklyn           9280                 64              145.000
## 3     Manhattan           7294                 47              155.191
## 4        Queens           6188                 48              128.917
## 5 Staten Island           1560                 20               78.000

Now we convert the firefighter_station data frame into a Spartial Data Frame to contains the geometry points.

firefighter_stations_sdf <- st_as_sf(firefighter_stations, coords = c("Longitude", "Latitude"), crs = 4326)
head(firefighter_stations_sdf)
## Simple feature collection with 6 features and 10 fields
## Geometry type: POINT
## Dimension:     XY
## Bounding box:  xmin: -74.01252 ymin: 40.70347 xmax: -73.9929 ymax: 40.71976
## Geodetic CRS:  WGS 84
##                                              FacilityName       FacilityAddress
## 1                                      Engine 4/Ladder 15       42 South Street
## 2                                     Engine 10/Ladder 10    124 Liberty Street
## 3                                                Engine 6     49 Beekman Street
## 4 Engine 7/Ladder 1/Battalion 1/Manhattan Borough Command  100-104 Duane Street
## 5                                                Ladder 8 14 North Moore Street
## 6                                       Engine 9/Ladder 6       75 Canal Street
##     borough Postcode Community.Board Community.Council Census.Tract     BIN
## 1 Manhattan    10005               1                 1            7 1000867
## 2 Manhattan    10006               1                 1           13 1075700
## 3 Manhattan    10038               1                 1         1501 1001287
## 4 Manhattan    10007               1                 1           33 1001647
## 5 Manhattan    10013               1                 1           33 1002150
## 6 Manhattan    10002               3                 1           16 1003898
##          BBL
## 1 1000350001
## 2 1000520022
## 3 1000930030
## 4 1001500025
## 5 1001890035
## 6 1003000030
##                                                                           NTA
## 1 Battery Park City-Lower Manhattan                                          
## 2 Battery Park City-Lower Manhattan                                          
## 3 Battery Park City-Lower Manhattan                                          
## 4 SoHo-TriBeCa-Civic Center-Little Italy                                     
## 5 SoHo-TriBeCa-Civic Center-Little Italy                                     
## 6 Chinatown                                                                  
##                     geometry
## 1 POINT (-74.00754 40.70347)
## 2 POINT (-74.01252 40.71007)
## 3 POINT (-74.00525 40.71005)
## 4 POINT (-74.00594 40.71546)
## 5 POINT (-74.00668 40.71976)
## 6  POINT (-73.9929 40.71521)

5.5.4.1 Downloand of the geojson files

At this point we download the .geojson files that contain all the geometry of each borough in order to have a cool maps visualization of NYC.

Here we load the NYC_BoroughBoundaries.geojson.

geoj_nyc_borough <- geojson_read("datasets/NYC_BoroughBoundaries.geojson",  what = "sp")
geoj_nyc_borough <- setNames(geoj_nyc_borough, c("borough_code", "borough", "shape_area", "shape_leng"))
geoj_nyc_borough$borough <- as.factor(geoj_nyc_borough$borough)
geoj_nyc_borough$borough_code <- NULL
head(geoj_nyc_borough)
##         borough    shape_area    shape_leng
## 1 Staten Island 1623620725.05  325917.35395
## 2     Manhattan 636520502.758 357713.308162
## 3         Bronx  1187174772.5 463180.579449
## 4      Brooklyn 1934138215.76 728146.574928
## 5        Queens 3041418506.64 888199.731385

And here the Congressiona_Districts.geojson.

geoj_nyc_cdist <- geojson_read("datasets/Congressional_Districts.geojson",  what = "sp")
geoj_nyc_cdist$cong_dist <- as.factor(geoj_nyc_cdist$cong_dist)
head(geoj_nyc_cdist)
##   cong_dist    shape_area    shape_leng
## 1        16 53350456.3831  38220.195314
## 2        11 1810287145.51 411602.819728
## 3         5 1313441026.63 647233.193924
## 4         6 723356617.906 198557.173232
## 5         8 736292261.799 575320.090507
## 6         9 422693976.037 115321.546362

And now we merge geoj_nyc_borough with count_inc_brough maintaining the Spartial Data Frame type.

geoj_nyc_borough@data = data.frame(geoj_nyc_borough@data, count_inc_brough[match(geoj_nyc_borough@data$borough, count_inc_brough$borough),])
geoj_nyc_borough@data$borough.1 <- NULL # remove the added column

Here we are making the same step as before but this this with the geoj_nyc_cdist merging via cong_dist, again maintaining the Spartial Data Frame type.

geoj_nyc_cdist@data = data.frame(geoj_nyc_cdist@data, count_inc_cdist[match(geoj_nyc_cdist@data$cong_dist, count_inc_cdist$cong_dist),])
geoj_nyc_cdist@data$cong_dist.1 <- NULL # remove the added column

And finally we can plot the interactive maps using the mapview function.

First for Incident Borough:

mapview(list(firefighter_stations_sdf, geoj_nyc_borough),
        zcol = list(NULL, "incident_count"),
        legend = list(FALSE, TRUE),
        homebutton = list(FALSE, TRUE),
        layer.name = list(NULL, "Incidents Count"),
        popup = list(popupTable(firefighter_stations_sdf), popupTable(geoj_nyc_borough)),
        alpha.regions = 0.5, aplha = 1)

And then for Congressional District:

mapview(geoj_nyc_cdist,
        zcol = "incident_count",
        legend = TRUE,
        popup = popupTable(geoj_nyc_cdist),
        homebutton = TRUE,
        layer.name = "Incidents Count",
        alpha.regions = 0.5, aplha = 1)

6 Let’s build some models (or at least try)

As suggested by the professor we have opted to linear regression using as response:

1- inc_resp_min_qy 2- emergency_min_qy

Initially we were thinking to solve a multi-classification / binary classification problem for the inc_class_group, however we were considering all the time difference predictors that are a future information w.r.t. the inc_class_group in prediction time, so it doesn’t make much sense to use them, and it is possible also that they will result in super predictors. That’s way we decided to grab the professor suggestion.

For both analysis we transform the relative response in log scale in order to simulate the behaviour of the Log-Norm distribution. Of course we have to verify that the linearty assumption are meet.

So first things first let’s check if there are some observations that have at least one of the two possible responses equal to zero, if so we have to remove then in order to apply next the log transformation.

summary(fire_data_new %>% select(inc_resp_min_qy, emergency_min_qy))
##  inc_resp_min_qy  emergency_min_qy
##  Min.   : 0.350   Min.   :  0.00  
##  1st Qu.: 4.383   1st Qu.:  7.15  
##  Median : 5.467   Median : 11.93  
##  Mean   : 5.949   Mean   : 17.08  
##  3rd Qu.: 6.850   3rd Qu.: 19.58  
##  Max.   :58.917   Max.   :596.38

Remove them.

fire_data_new <- fire_data_new %>% filter(emergency_min_qy != 0)

Then we have to check the presence of correlation in the continuous predictors and if so deleting one or more of them. Thus here we are doing an initial chec of multicollinearity problems only in the numerical variable before once we create our for the future models. In order to do so compute the square of the correlation matrix.

round(cor(fire_data_new %>% dplyr::select(where(is.numeric)))^2, digits=3)
##                       inc_resp_min_qy engines_assigned ladders_assigned
## inc_resp_min_qy                 1.000            0.064            0.015
## engines_assigned                0.064            1.000            0.317
## ladders_assigned                0.015            0.317            1.000
## others_units_assigned           0.023            0.294            0.325
## emergency_min_qy                0.000            0.046            0.017
## total_assigned_unit             0.045            0.719            0.693
##                       others_units_assigned emergency_min_qy
## inc_resp_min_qy                       0.023            0.000
## engines_assigned                      0.294            0.046
## ladders_assigned                      0.325            0.017
## others_units_assigned                 1.000            0.121
## emergency_min_qy                      0.121            1.000
## total_assigned_unit                   0.704            0.077
##                       total_assigned_unit
## inc_resp_min_qy                     0.045
## engines_assigned                    0.719
## ladders_assigned                    0.693
## others_units_assigned               0.704
## emergency_min_qy                    0.077
## total_assigned_unit                 1.000

As we can see total_assigned_unit is heavily correlated to the other counts since it is the sum of those, thus deceide to remove the sum of units. Continuing we note also that lot’s of time difference are correlated to each other, whoever it is obvious since some of them include other smaller difference, these measures will be managed soon once we deal with the two type of analysis.

fire_data_new <- fire_data_new %>% select(-c(total_assigned_unit))

Next before creating any model have to split the cleaned dataset into train and test, with 80% of the whole dataset for the train set and the remaining 20% for the test set.

set.seed(43)
split <- sample.split(fire_data_new, SplitRatio = 0.8)

# Create training and testing sets
fire_data.train <- subset(fire_data_new, split == TRUE)
fire_data.test <- subset(fire_data_new, split == FALSE)

rownames(fire_data.train) <- NULL
rownames(fire_data.test) <- NULL

dim(fire_data.train)
## [1] 24580    15
dim(fire_data.test)
## [1] 6146   15

7 Linear Regression???

7.1 Use inc_resp_min_qy as response

In this section we use inc_resp_min_qy as response, thus we omit emergency_time_qy since it is a future difference of time that is not know at prediction time.

# make a copy of the train and test
resp_min_fd.train <- fire_data.train
resp_min_fd.test <- fire_data.test

# remove the future time differences
resp_min_fd.train <- resp_min_fd.train %>% select(-c(emergency_min_qy))
resp_min_fd.test <- resp_min_fd.test %>% select(-c(emergency_min_qy))

Let’s build our first Linear Regression Model

lm_irm_full <- lm(inc_resp_min_qy ~ ., data = resp_min_fd.train)
summary(lm_irm_full)
## 
## Call:
## lm(formula = inc_resp_min_qy ~ ., data = resp_min_fd.train)
## 
## Residuals:
##    Min     1Q Median     3Q    Max 
## -7.579 -1.362 -0.387  0.797 52.096 
## 
## Coefficients:
##                                        Estimate Std. Error t value Pr(>|t|)    
## (Intercept)                            6.554708   0.223239  29.362  < 2e-16 ***
## inc_boroughBrooklyn                   -0.777498   0.111424  -6.978 3.07e-12 ***
## inc_boroughManhattan                  -0.317584   0.096424  -3.294 0.000990 ***
## inc_boroughQueens                      0.017795   0.103635   0.172 0.863665    
## inc_boroughStaten Island              -0.686276   0.180180  -3.809 0.000140 ***
## cong_dist5                            -0.526126   0.143579  -3.664 0.000248 ***
## cong_dist6                            -0.136570   0.150617  -0.907 0.364558    
## cong_dist7                            -0.297748   0.154653  -1.925 0.054207 .  
## cong_dist8                            -0.074574   0.170482  -0.437 0.661803    
## cong_dist9                            -0.064911   0.173211  -0.375 0.707848    
## cong_dist10                           -0.211573   0.169415  -1.249 0.211733    
## cong_dist11                            0.019877   0.203041   0.098 0.922014    
## cong_dist12                            0.769480   0.180132   4.272 1.95e-05 ***
## cong_dist13                            0.034462   0.172636   0.200 0.841776    
## cong_dist14                            0.363853   0.159679   2.279 0.022697 *  
## cong_dist15                            0.001747   0.173684   0.010 0.991972    
## cong_dist16                            0.189488   0.241924   0.783 0.433485    
## al_source_descEMS                     -0.183994   0.107132  -1.717 0.085909 .  
## al_source_descEMS-911                 -0.214432   0.108236  -1.981 0.047585 *  
## al_source_descCLASS-3                  0.009961   0.054614   0.182 0.855281    
## al_source_descOthers                  -1.038050   0.113715  -9.129  < 2e-16 ***
## al_index_descInitial Alarm            -0.209377   0.071230  -2.939 0.003291 ** 
## al_index_descOthers                    1.033326   0.775457   1.333 0.182696    
## highest_al_levelNonFirst Alarm        -0.273546   0.787056  -0.348 0.728177    
## inc_class_groupMedical MFAs           -0.298057   0.245141  -1.216 0.224051    
## inc_class_groupNonMedical Emergencies  0.463379   0.105871   4.377 1.21e-05 ***
## inc_class_groupNonMedical MFAs         0.272255   0.154816   1.759 0.078662 .  
## inc_class_groupNonStructural Fires    -0.206078   0.162329  -1.270 0.204273    
## inc_class_groupStructural Fires       -0.051989   0.135047  -0.385 0.700264    
## engines_assigned                      -0.480012   0.027444 -17.491  < 2e-16 ***
## ladders_assigned                       0.185240   0.043240   4.284 1.84e-05 ***
## others_units_assigned                 -0.082868   0.027276  -3.038 0.002383 ** 
## day_typeWeekend                       -0.183876   0.036559  -5.030 4.95e-07 ***
## working_hourTRUE                      -0.263551   0.098290  -2.681 0.007337 ** 
## time_of_dayMorning                     0.009227   0.101972   0.090 0.927906    
## time_of_dayAfternoon                  -0.132103   0.110688  -1.193 0.232695    
## time_of_dayEvening                    -0.701469   0.054108 -12.964  < 2e-16 ***
## tua_is_oneTRUE                         1.083569   0.059964  18.070  < 2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 2.543 on 24542 degrees of freedom
## Multiple R-squared:  0.1283, Adjusted R-squared:  0.127 
## F-statistic: 97.63 on 37 and 24542 DF,  p-value: < 2.2e-16

By the summary we can see that the model \(R^2\) tell us that we are able explain around the \(12.83\%\) of the total variability of the data and also the Adjusted R-squared which penalise the complexity of our model tell us the same story.

Let’s interpret the complete model by first specifying what categorical variable are contained in the intercept:

  • inc_borough = Bronx
  • cong_dist = 3
  • al_source_desc = PHONE
  • al_index_desc = DEFAULT RECORD
  • highest_al_level = First Alarm
  • inc_class_group = Medical Emergencies
  • day_type = Weekday
  • working_hour = FALSE
  • time_of_day = Night
  • tua_is_one = FALSE

So the intercept explain the mean response time in minutes for an incident with the cited values of categorical predictors and the value of the three following numeric predictors:

  • engines_assiged = 0
  • ladders_assigned = 0
  • others_units_assigned = 0

Speaking a little bit about the p-value we can mention that:

  • inc_boroughBrooklyn, inc_boroughManhattan, inc_boroughStaten Island have all significant p-value, this indicate the difference on mean response time for an incident having the intercept predictor values and an incident with the same predictor except the borough being one of the respective three. All of three have a decreasing effect on the mean response time.
  • Regarding the cong_dist only cong_dist5 and cong_dist12 have a significant p-value, both with a decreasing effect on the mean response time of an incident with the intercept characteristics except the cong_dist being on of the specified two.
  • Interestingly al_source_descOthers indicate a big decrease on the mean response time respect the reference level.
  • engines_assigned and others_units_assigned have a decreasing effect on the mean response time with the latter being quite reasonable significant and the former being heavely significant. On the other hand ladders_assigned is significant but differently respect the other two units measure its effect is increase the mean response time of the reference level.
  • day_typeWeekend is significant with an decrease effect on the mean response time for the reference level. The same discussion can be made for working_hourTRUE.
  • time_of_dayEvening and tua_is_oneY are significant with the first having a decreasing and the second having a increasing effect on the reference level.

At this point we can analyse the regression diagnostic plots:

par(mfrow=c(2,2))
plot(lm_irm_full)

Now we have to see if the linearity assumption are met and thus if we can use a linear regression model for our analysis.

  1. Residuals vs Fitted plot: here we can see if our residuals have a linear pattern and this is confermed by the straight horizontal red line. Even if,we have an higher amount of spreaded observations on the top of the red line.
  2. Q-Q Residuals plot: in this plot called also quantile - quantile residual plot and tells us if the residuals are normally distributed or not. If they follows the 45 degrees dotted line we can say so otherwise as in our case we can’t say that are normally distributed, as we will see in much detail later.
  3. Scaled-Location / Spread-Location plot: tells us if the residuals are equally spread across the predictors. This is the assessments of Homoscedasticity or equal variance. And we would like to see a sort of horizontal line, in our case we are not particullary happy.
  4. Residuals VS Leverage plot: helps us to identify the influential points with the Cook’s distance, so points that have influence on the regression line. And if some point feed in the area delimited by the dotted lines those points will be assigned as influential. In our case we do not have any observations that satisfy what we have just saied

Moreover we better analyse the distribution of the residuals by using the qqPlot from the car package. This will produce a better plot of the distribution of the residuals with relative confidence interval ban in blue.

qqPlot(residuals(lm_irm_full))

## [1] 12582 22268

Much clearly the qqPlot tells that the data the residuals are not normally distributed indeed are heavily right skewed. Thus we can’t trust the p-values and the estimation of the coefficients.

Here instead we have a look of all plots of residuals vs predictors and again the plot of residuals vs fitted values that we already see.

residualPlots(lm_irm_full)

##                       Test stat Pr(>|Test stat|)    
## inc_borough                                         
## cong_dist                                           
## al_source_desc                                      
## al_index_desc                                       
## highest_al_level                                    
## inc_class_group                                     
## engines_assigned         7.3922        1.491e-13 ***
## ladders_assigned         6.4210        1.379e-10 ***
## others_units_assigned    5.8162        6.093e-09 ***
## day_type                                            
## working_hour                                        
## time_of_day                                         
## tua_is_one                                          
## Tukey test               7.1290        1.011e-12 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Both types of test statistics (residuals vs numerical predictors and Tukey test) say that we can not reject the null hypothesis that there is a lack of relationship.

Let’s have a look of the possible power transformation of the response.

powerTransform(lm_irm_full)
## Estimated transformation parameter 
##        Y1 
## 0.0918338

The function powerTransform suggests to take the log-transformation of the response, we take the log transformation because the estimated value of lambda is close to zero.

But before updating the full model scaling the response by logarithm let’s check if we have multicollinearity. To do so we use the vif function by the car package.

vif(lm_irm_full)
##                             GVIF Df GVIF^(1/(2*Df))
## inc_borough           296.746107  4        2.037268
## cong_dist             306.824565 12        1.269464
## al_source_desc         13.194620  4        1.380542
## al_index_desc          14.751137  2        1.959776
## highest_al_level        8.021463  1        2.832219
## inc_class_group        19.943038  5        1.348898
## engines_assigned        2.611920  1        1.616144
## ladders_assigned        4.764514  1        2.182777
## others_units_assigned   2.371379  1        1.539928
## day_type                1.005439  1        1.002716
## working_hour            8.902130  1        2.983644
## time_of_day             8.994580  3        1.442105
## tua_is_one              3.335387  1        1.826304

We decide to use as rule of thumb \(GIF < 10\) no strong multicollinearity problem, otherwise we have to delete some predictors since in some sense their information is kept by others. In our case we decide to remove the following predictors:

  • cong_dist
  • inc_class_group
  • highest_al_level

Let’s update our model scaling the response to the logarithm scale and remove the previously listed predictors.

lm_irm_full_upd <- update(lm_irm_full, log(inc_resp_min_qy) ~ . - cong_dist - inc_class_group - highest_al_level)
vif(lm_irm_full_upd)
##                           GVIF Df GVIF^(1/(2*Df))
## inc_borough           1.036957  4        1.004547
## al_source_desc        3.850387  4        1.183554
## al_index_desc         1.654411  2        1.134125
## engines_assigned      2.340625  1        1.529910
## ladders_assigned      4.461953  1        2.112334
## others_units_assigned 2.344313  1        1.531115
## day_type              1.004243  1        1.002119
## working_hour          8.895229  1        2.982487
## time_of_day           8.958791  3        1.441147
## tua_is_one            3.230816  1        1.797447

No multicollinearity problem, now we can analyse the sumamry of the updated model.

summary(lm_irm_full_upd)
## 
## Call:
## lm(formula = log(inc_resp_min_qy) ~ inc_borough + al_source_desc + 
##     al_index_desc + engines_assigned + ladders_assigned + others_units_assigned + 
##     day_type + working_hour + time_of_day + tua_is_one, data = resp_min_fd.train)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -2.47299 -0.20076 -0.00435  0.19997  2.31625 
## 
## Coefficients:
##                              Estimate Std. Error t value Pr(>|t|)    
## (Intercept)                 1.9124315  0.0150958 126.687  < 2e-16 ***
## inc_boroughBrooklyn        -0.1773411  0.0068022 -26.071  < 2e-16 ***
## inc_boroughManhattan       -0.0365833  0.0071839  -5.092 3.56e-07 ***
## inc_boroughQueens          -0.0485684  0.0074903  -6.484 9.09e-11 ***
## inc_boroughStaten Island   -0.1271892  0.0119259 -10.665  < 2e-16 ***
## al_source_descEMS          -0.0697329  0.0091169  -7.649 2.10e-14 ***
## al_source_descEMS-911      -0.0754873  0.0098411  -7.671 1.77e-14 ***
## al_source_descCLASS-3       0.0314929  0.0078552   4.009 6.11e-05 ***
## al_source_descOthers       -0.4329450  0.0165747 -26.121  < 2e-16 ***
## al_index_descInitial Alarm -0.0170164  0.0083054  -2.049 0.040490 *  
## al_index_descOthers        -0.0003746  0.0483631  -0.008 0.993819    
## engines_assigned           -0.0860849  0.0038124 -22.580  < 2e-16 ***
## ladders_assigned            0.0220199  0.0061406   3.586 0.000336 ***
## others_units_assigned      -0.0045723  0.0039798  -1.149 0.250620    
## day_typeWeekend            -0.0201366  0.0053617  -3.756 0.000173 ***
## working_hourTRUE           -0.0690787  0.0144184  -4.791 1.67e-06 ***
## time_of_dayMorning         -0.0091723  0.0149534  -0.613 0.539625    
## time_of_dayAfternoon       -0.0235714  0.0162331  -1.452 0.146497    
## time_of_dayEvening         -0.1320686  0.0079344 -16.645  < 2e-16 ***
## tua_is_oneTRUE              0.1495102  0.0086606  17.263  < 2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 0.3731 on 24560 degrees of freedom
## Multiple R-squared:  0.1539, Adjusted R-squared:  0.1532 
## F-statistic: 235.1 on 19 and 24560 DF,  p-value: < 2.2e-16

Now remember that we can’t compere the \(R^2\) with the previous model since in the last one the response is on a different scale.

Regarding the reference level here the discussion is the same as the previous one, except the fact that the reference level is on logarithm scale so the predictors have a increasing or decreasing effect on the mean logarithm response time.

Now we analyse the regression diagnostic plots for the log response model.

par(mfrow=c(2,2))
plot(lm_irm_full_upd)

Like the previous model we can say the residuals follow a linear pattern much better that the previous model. Again we do not have any influential point. But on the other hand the qqPlot is pretty much a mess indicating that the residuals are not normally distributed, by this qqPlot we can say that:

  1. The smallest observations are larger than you would expect from a normal distribution (i.e. the points are above the line on the QQ-plot). This means the lower tail of the data’s distribution has been reduced, relative to a normal distribution.
  2. The largest observations are less than you would expect from a normal distribution (i.e. the points are below the line on the QQ-plot). This means the upper tail of the data’s distribution has been reduced, relative to a normal distribution.

Let’s much clearly analyse the new qqPlot.

qqPlot(residuals(lm_irm_full_upd))

## [1] 12203 10837

So in other word this “S” shaped qqPlot with a linear portion in the middle suggests the data have more extreme values (or outliers) than the normal distribution in the tails. This indicate the following statements:

1- The coefficient estimates are unbiased and consistent. 2- The Standard Error could be wrong. 3- The p-value are usually much lower respect to the real p-value. This means that a p-value in the order of 0.0001 could not be really significant.

Now we can see the residuals vs predictors and residuals vs fitted values.

residualPlots(lm_irm_full_upd)

##                       Test stat Pr(>|Test stat|)    
## inc_borough                                         
## al_source_desc                                      
## al_index_desc                                       
## engines_assigned         6.7460        1.553e-11 ***
## ladders_assigned         5.9483        2.746e-09 ***
## others_units_assigned    5.9928        2.091e-09 ***
## day_type                                            
## working_hour                                        
## time_of_day                                         
## tua_is_one                                          
## Tukey test               6.3188        2.636e-10 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Again both types of test statistics (residuals vs numerical predictors and Tukey test) say that we can not reject the null hypothesis that there is a lack of relationship.

Thus again the linear assumptions are not meet, mainly by the non normal distribution of the residuals.

At this point we have decided in any case to investigate this behaviour trying to fix the non normality of the residuals by modifying the scale of some predictors and adding interaction term between them.

So the edits we will perform are the following one:

  • Removing the non significant predictors
  • Adding the interaction terms
  • Scaling the number of assigned units by the logarithm scale after having increased by a single units
lm_irm_full_upd_2 <- update(lm_irm_full_upd, . ~ . - others_units_assigned - engines_assigned + log(engines_assigned + 1) - ladders_assigned + log(ladders_assigned + 1) - others_units_assigned + log(others_units_assigned + 1) + log(engines_assigned + 1))
summary(lm_irm_full_upd_2)
## 
## Call:
## lm(formula = log(inc_resp_min_qy) ~ inc_borough + al_source_desc + 
##     al_index_desc + day_type + working_hour + time_of_day + tua_is_one + 
##     log(engines_assigned + 1) + log(ladders_assigned + 1) + log(others_units_assigned + 
##     1), data = resp_min_fd.train)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -2.51334 -0.20152 -0.00374  0.20014  2.31666 
## 
## Coefficients:
##                                 Estimate Std. Error t value Pr(>|t|)    
## (Intercept)                     2.026469   0.017578 115.285  < 2e-16 ***
## inc_boroughBrooklyn            -0.177803   0.006790 -26.185  < 2e-16 ***
## inc_boroughManhattan           -0.036462   0.007168  -5.087 3.67e-07 ***
## inc_boroughQueens              -0.049448   0.007482  -6.609 3.95e-11 ***
## inc_boroughStaten Island       -0.129561   0.011906 -10.882  < 2e-16 ***
## al_source_descEMS              -0.065330   0.009697  -6.737 1.65e-11 ***
## al_source_descEMS-911          -0.071254   0.010349  -6.885 5.92e-12 ***
## al_source_descCLASS-3           0.028023   0.007847   3.571 0.000356 ***
## al_source_descOthers           -0.430226   0.016604 -25.911  < 2e-16 ***
## al_index_descInitial Alarm     -0.016612   0.008287  -2.005 0.045017 *  
## al_index_descOthers            -0.016943   0.041840  -0.405 0.685530    
## day_typeWeekend                -0.020052   0.005350  -3.748 0.000179 ***
## working_hourTRUE               -0.070061   0.014387  -4.870 1.12e-06 ***
## time_of_dayMorning             -0.007745   0.014921  -0.519 0.603709    
## time_of_dayAfternoon           -0.022317   0.016197  -1.378 0.168274    
## time_of_dayEvening             -0.131385   0.007917 -16.596  < 2e-16 ***
## tua_is_oneTRUE                  0.078191   0.010123   7.724 1.17e-14 ***
## log(engines_assigned + 1)      -0.193912   0.008353 -23.215  < 2e-16 ***
## log(ladders_assigned + 1)      -0.026963   0.011289  -2.388 0.016931 *  
## log(others_units_assigned + 1) -0.057145   0.009286  -6.154 7.67e-10 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 0.3723 on 24560 degrees of freedom
## Multiple R-squared:  0.1576, Adjusted R-squared:  0.1569 
## F-statistic: 241.8 on 19 and 24560 DF,  p-value: < 2.2e-16

Now we can compare this \(R^2 = 0.1576\) with the one of the previous model which is \(R^2 = 0.1539\). So with the newest model we have explained \(0.37\%\) much variability respect to the initial model. Pretty poor result I would say, so the effort of changing the scale of the assigned units doesn’t paied off. We have already tried to insert some kind of interaction terms, however the model \(R^2\) results in a lower value respect to the initial one thus we prefer the cleaned model after the multicollinearity check.

Anyway we contine thus analysis by interpreting the updated model. The intercept level is composed by the following predictors:

Let’s interpret the complete model by first specifying what categorical variable are contained in the intercept:

  • inc_borough = Broonx
  • al_source_desc = PHONE
  • al_index_desc = DEFAULT RECORD
  • day_type = Weekday
  • working_hour = FALSE
  • time_of_day = Night
  • tua_is_one = FALSE

So the intercept explain the mean logarithm response time for an incident with the cited values of categorical predictors and the value of the three following numeric predictors:

  • log(engines_assigned + 1) = 0
  • log(others_units_assigned + 1) = 0
  • log(ladders_assigned + 1) = 0

Since we add an interaction term, let’s discuss who it behaves in our model:

Next the scaled logarithm assigned units indicates the average decrease in terms of logarithm response time per logarithm assigned units + 1 for the intercept level.

Now we analyse the regression diagnostic plots.

par(mfrow=c(2,2))
plot(lm_irm_full_upd_2)

  1. Residuals vs Fitted plot: here we can see that the residuals have a linear pattern.
  2. Q-Q Residuals plot: again we do not have the situation of normal distribution of the residuals, as we will have a closer look soon.
  3. Scaled-Location / Spread-Location plot: homoscedasticity quite reached even if we have a light curvature
  4. Residuals VS Leverage plot: we can’t see any influential pont
qqPlot(residuals(lm_irm_full_upd_2))

## [1] 12203 10837

Unfortunately the situation is the same as the previous described one.

residualPlots(lm_irm_full_upd_2)

##                                Test stat Pr(>|Test stat|)    
## inc_borough                                                  
## al_source_desc                                               
## al_index_desc                                                
## day_type                                                     
## working_hour                                                 
## time_of_day                                                  
## tua_is_one                                                   
## log(engines_assigned + 1)         4.6314        3.650e-06 ***
## log(ladders_assigned + 1)        10.3532        < 2.2e-16 ***
## log(others_units_assigned + 1)    5.6596        1.534e-08 ***
## Tukey test                        3.9529        7.722e-05 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
influenceIndexPlot(lm_irm_full_upd_2, vars = "Cook")

Again both types of test statistics (residuals vs numerical predictors and Tukey test) say that we can not reject the null hypothesis that there is a lack of relationship.

In conclusion unfortunately we can’t apply linear regression on the log scale of inc_resp_min_qy, the main problem is the distribution of the residuals with is not normal thus we can’t really trust the p-values end the estimation of the model coefficients.

Let’s see if in the other type response the assumption of linearity are meet or not (spoiler…they are not verified again :( ).

7.2 Use emergency_min_qy as response

Fit a linear regression model with all the predictors.

lm_em_full <- lm(emergency_min_qy ~ ., data = fire_data.train)
summary(lm_em_full)
## 
## Call:
## lm(formula = emergency_min_qy ~ ., data = fire_data.train)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -117.66   -8.59   -3.09    3.95  429.84 
## 
## Coefficients:
##                                       Estimate Std. Error t value Pr(>|t|)    
## (Intercept)                            0.76022    1.44352   0.527 0.598447    
## inc_boroughBrooklyn                   -5.72702    0.70887  -8.079 6.82e-16 ***
## inc_boroughManhattan                  -0.85206    0.61297  -1.390 0.164523    
## inc_boroughQueens                     -3.05730    0.65866  -4.642 3.47e-06 ***
## inc_boroughStaten Island              -4.41859    1.14549  -3.857 0.000115 ***
## cong_dist5                             0.75546    0.91278   0.828 0.407875    
## cong_dist6                             2.70597    0.95728   2.827 0.004706 ** 
## cong_dist7                             1.27991    0.98298   1.302 0.192905    
## cong_dist8                             2.31850    1.08352   2.140 0.032382 *  
## cong_dist9                             1.30084    1.10086   1.182 0.237354    
## cong_dist10                            1.54859    1.07677   1.438 0.150395    
## cong_dist11                            0.76077    1.29044   0.590 0.555506    
## cong_dist12                            2.39445    1.14527   2.091 0.036563 *  
## cong_dist13                            4.64611    1.09720   4.235 2.30e-05 ***
## cong_dist14                            2.67685    1.01496   2.637 0.008360 ** 
## cong_dist15                            3.21466    1.10386   2.912 0.003592 ** 
## cong_dist16                            2.22881    1.53759   1.450 0.147198    
## al_source_descEMS                      1.57887    0.68093   2.319 0.020419 *  
## al_source_descEMS-911                  2.64836    0.68796   3.850 0.000119 ***
## al_source_descCLASS-3                 -3.86922    0.34710 -11.147  < 2e-16 ***
## al_source_descOthers                   2.85085    0.72395   3.938 8.24e-05 ***
## al_index_descInitial Alarm            16.02968    0.45279  35.402  < 2e-16 ***
## al_index_descOthers                   64.18114    4.92866  13.022  < 2e-16 ***
## highest_al_levelNonFirst Alarm        60.26687    5.00222  12.048  < 2e-16 ***
## inc_class_groupMedical MFAs           -1.89320    1.55806  -1.215 0.224341    
## inc_class_groupNonMedical Emergencies -7.42976    0.67313 -11.038  < 2e-16 ***
## inc_class_groupNonMedical MFAs        -3.12835    0.98401  -3.179 0.001479 ** 
## inc_class_groupNonStructural Fires     2.57761    1.03173   2.498 0.012484 *  
## inc_class_groupStructural Fires       -6.30824    0.85831  -7.350 2.05e-13 ***
## inc_resp_min_qy                        0.12082    0.04057   2.978 0.002904 ** 
## engines_assigned                       1.25752    0.17550   7.165 7.99e-13 ***
## ladders_assigned                       1.42824    0.27492   5.195 2.06e-07 ***
## others_units_assigned                  4.20284    0.17339  24.240  < 2e-16 ***
## day_typeWeekend                       -0.24967    0.23247  -1.074 0.282837    
## working_hourTRUE                      -0.76343    0.62478  -1.222 0.221754    
## time_of_dayMorning                     2.10198    0.64809   3.243 0.001183 ** 
## time_of_dayAfternoon                   1.54255    0.70350   2.193 0.028341 *  
## time_of_dayEvening                    -0.73022    0.34506  -2.116 0.034338 *  
## tua_is_oneTRUE                         1.40265    0.38363   3.656 0.000256 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 16.16 on 24541 degrees of freedom
## Multiple R-squared:  0.3359, Adjusted R-squared:  0.3349 
## F-statistic: 326.7 on 38 and 24541 DF,  p-value: < 2.2e-16

The first thing that come to our eyes is the \(R^2\) value which in this case is \(0.3359\) thus we explain more or less 1/3 of the variability of the data. Also the Adjusted R-squared has a similar value which is \(0.3349\).

Now let’s interpret the complete model by first specifying what categorical variable are contained in the intercept:

  • inc_borough = Bronx
  • cong_dist = 3
  • al_source_desc = PHONE
  • al_index_desc = DEFAULT RECORD
  • highest_al_level = First Alarm
  • inc_class_group = Medical Emergencies
  • day_type = Weekday
  • working_hour = FALSE
  • time_of_day = Night
  • tua_is_one = FALSE

So the intercept explain the mean emergency time in minutes taken for an incident with the cited values of categorical predictors and the value of the three following numeric predictors:

  • engines_assiged = 0
  • ladders_assigned = 0
  • others_units_assigned = 0

Speaking a little bit about the p-value we can mention some coefficients:

  • inc_boroughBrooklyn, inc_boroughManhattan, inc_boroughStaten Island have all significant p-value, this indicate the difference on mean emergency time taken in minutes for an incident having the intercept predictor values and an incident with the same predictor except the borough being one of the respective three. All of three have a decreasing effect on the mean emergency time, respectively of \(-5.72702\) minutes, \(-3.05730\) minutes and \(-4.41859\) minutes.
  • Regarding the cong_dist only cong_dist13 has a statistically high significant p-value, it has a increase effect on the mean emergency time in minutes of an incident with the intercept characteristics except the cong_dist.
  • For al_source_desc we have all the factorial predictor highly statistically significant except al_source_descEMS which has a low level of p-value
  • Interestingly we can see that al_index_descInitial, al_index_descOthers have an high increasing decreasing effect on the mean emergency time taken of an incident with the intercept characteristics except the al_index_desc being on of the specified two. Respectively of \(16.02968\) and \(64.18114\) minutes.
  • highest_al_levelNonFirst has an heavy increasing effect (\(60.26687\) minutes) on the mean emergency time taken respect the reference level.

Now we analyse the regression diagnostic plots for the log response model.

par(mfrow=c(2,2))
plot(lm_em_full)

Here the fitted values vs the residual are behaving in a liner relation but they are not randomly spread since we can view three or even more clusters , the same thing discussion can be made for the Scale Location plot in which we can see that there is no constant variance. But again we are in a situation in which the residuals are not normally distributed as we can more clearly see in the following plot.

qqPlot(residuals(lm_em_full))

## [1] 18469 22272

Let’s have a look of the possible power transformation of the response.

powerTransform(lm_em_full)
## Estimated transformation parameter 
##        Y1 
## 0.1396974

We will try transform the response both using the logarithm scale and the power of 0.14. in order to see in we have an improvement on the distribution of the residuals.

But before updating the full model scaling the response by logarithm let’s check if we have multicollinearity. To do so we use the vif function by the car package.

vif(lm_em_full)
##                             GVIF Df GVIF^(1/(2*Df))
## inc_borough           297.733490  4        2.038114
## cong_dist             309.983817 12        1.270006
## al_source_desc         13.241310  4        1.381152
## al_index_desc          14.757897  2        1.960000
## highest_al_level        8.021502  1        2.832226
## inc_class_group        20.007948  5        1.349336
## inc_resp_min_qy         1.147185  1        1.071067
## engines_assigned        2.644479  1        1.626186
## ladders_assigned        4.768077  1        2.183593
## others_units_assigned   2.372271  1        1.540218
## day_type                1.006475  1        1.003232
## working_hour            8.904738  1        2.984081
## time_of_day             9.067792  3        1.444055
## tua_is_one              3.379765  1        1.838414

Of course we have a similar problem as the previous analysis. We decide to remove the same predictors so cong_dist, and highest_al_level.

Now we can update the initial model by using the suggested power transformation of 0.14 and remove the previously mentioned predictors.

lm_em_full_014 <- update(lm_em_full, I(emergency_min_qy ^ 0.14) ~ . - cong_dist - highest_al_level)
summary(lm_em_full_014)
## 
## Call:
## lm(formula = I(emergency_min_qy^0.14) ~ inc_borough + al_source_desc + 
##     al_index_desc + inc_class_group + inc_resp_min_qy + engines_assigned + 
##     ladders_assigned + others_units_assigned + day_type + working_hour + 
##     time_of_day + tua_is_one, data = fire_data.train)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -0.77457 -0.09273 -0.00692  0.09106  0.89223 
## 
## Coefficients:
##                                         Estimate Std. Error t value Pr(>|t|)
## (Intercept)                            1.2256529  0.0094388 129.853  < 2e-16
## inc_boroughBrooklyn                   -0.0785052  0.0028285 -27.755  < 2e-16
## inc_boroughManhattan                  -0.0091956  0.0029611  -3.105  0.00190
## inc_boroughQueens                     -0.0529385  0.0030878 -17.144  < 2e-16
## inc_boroughStaten Island              -0.0883651  0.0049191 -17.964  < 2e-16
## al_source_descEMS                      0.0170292  0.0064694   2.632  0.00849
## al_source_descEMS-911                  0.0200506  0.0065351   3.068  0.00216
## al_source_descCLASS-3                 -0.0540694  0.0032835 -16.467  < 2e-16
## al_source_descOthers                   0.0209596  0.0068772   3.048  0.00231
## al_index_descInitial Alarm             0.2946077  0.0042983  68.541  < 2e-16
## al_index_descOthers                    0.6628939  0.0205279  32.292  < 2e-16
## inc_class_groupMedical MFAs            0.0142557  0.0148068   0.963  0.33567
## inc_class_groupNonMedical Emergencies -0.0740680  0.0063958 -11.581  < 2e-16
## inc_class_groupNonMedical MFAs        -0.0179087  0.0093479  -1.916  0.05540
## inc_class_groupNonStructural Fires    -0.0030321  0.0098016  -0.309  0.75706
## inc_class_groupStructural Fires       -0.0631966  0.0081515  -7.753 9.34e-15
## inc_resp_min_qy                       -0.0010962  0.0003836  -2.857  0.00428
## engines_assigned                       0.0047983  0.0016628   2.886  0.00391
## ladders_assigned                       0.0135935  0.0026071   5.214 1.86e-07
## others_units_assigned                  0.0146545  0.0016411   8.930  < 2e-16
## day_typeWeekend                       -0.0003086  0.0022088  -0.140  0.88890
## working_hourTRUE                      -0.0154905  0.0059365  -2.609  0.00908
## time_of_dayMorning                     0.0365238  0.0061570   5.932 3.03e-09
## time_of_dayAfternoon                   0.0309941  0.0066845   4.637 3.56e-06
## time_of_dayEvening                    -0.0010327  0.0032788  -0.315  0.75280
## tua_is_oneTRUE                        -0.0283558  0.0036414  -7.787 7.13e-15
##                                          
## (Intercept)                           ***
## inc_boroughBrooklyn                   ***
## inc_boroughManhattan                  ** 
## inc_boroughQueens                     ***
## inc_boroughStaten Island              ***
## al_source_descEMS                     ** 
## al_source_descEMS-911                 ** 
## al_source_descCLASS-3                 ***
## al_source_descOthers                  ** 
## al_index_descInitial Alarm            ***
## al_index_descOthers                   ***
## inc_class_groupMedical MFAs              
## inc_class_groupNonMedical Emergencies ***
## inc_class_groupNonMedical MFAs        .  
## inc_class_groupNonStructural Fires       
## inc_class_groupStructural Fires       ***
## inc_resp_min_qy                       ** 
## engines_assigned                      ** 
## ladders_assigned                      ***
## others_units_assigned                 ***
## day_typeWeekend                          
## working_hourTRUE                      ** 
## time_of_dayMorning                    ***
## time_of_dayAfternoon                  ***
## time_of_dayEvening                       
## tua_is_oneTRUE                        ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 0.1536 on 24554 degrees of freedom
## Multiple R-squared:  0.2948, Adjusted R-squared:  0.2941 
## F-statistic: 410.6 on 25 and 24554 DF,  p-value: < 2.2e-16

Let’s see the residuals.

par(mfrow=c(2,2))
plot(lm_em_full_014)

qqPlot(residuals(lm_em_full_014))

## [1] 18658   733

In this case the two tails appear to be more homogeneous and better respect the previous model and even the previous analysis, however the final conclusion is the same so we do not have normal distribution of residuals as we can see by the qqPlot.

Trying the logarithm scale.

lm_em_full_log <- update(lm_em_full, log(emergency_min_qy) ~ . - cong_dist - highest_al_level)
summary(lm_em_full_log)
## 
## Call:
## lm(formula = log(emergency_min_qy) ~ inc_borough + al_source_desc + 
##     al_index_desc + inc_class_group + inc_resp_min_qy + engines_assigned + 
##     ladders_assigned + others_units_assigned + day_type + working_hour + 
##     time_of_day + tua_is_one, data = fire_data.train)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -5.5058 -0.4388  0.0073  0.4900  4.1649 
## 
## Coefficients:
##                                        Estimate Std. Error t value Pr(>|t|)    
## (Intercept)                            1.350539   0.049008  27.558  < 2e-16 ***
## inc_boroughBrooklyn                   -0.382856   0.014686 -26.069  < 2e-16 ***
## inc_boroughManhattan                  -0.045002   0.015375  -2.927  0.00342 ** 
## inc_boroughQueens                     -0.257843   0.016033 -16.082  < 2e-16 ***
## inc_boroughStaten Island              -0.439799   0.025541 -17.219  < 2e-16 ***
## al_source_descEMS                      0.078571   0.033591   2.339  0.01934 *  
## al_source_descEMS-911                  0.090889   0.033931   2.679  0.00740 ** 
## al_source_descCLASS-3                 -0.269415   0.017049 -15.803  < 2e-16 ***
## al_source_descOthers                   0.073352   0.035708   2.054  0.03996 *  
## al_index_descInitial Alarm             1.639968   0.022318  73.483  < 2e-16 ***
## al_index_descOthers                    3.124788   0.106585  29.317  < 2e-16 ***
## inc_class_groupMedical MFAs            0.158559   0.076880   2.062  0.03918 *  
## inc_class_groupNonMedical Emergencies -0.369629   0.033208 -11.131  < 2e-16 ***
## inc_class_groupNonMedical MFAs        -0.051299   0.048536  -1.057  0.29056    
## inc_class_groupNonStructural Fires    -0.046055   0.050892  -0.905  0.36550    
## inc_class_groupStructural Fires       -0.324124   0.042324  -7.658 1.96e-14 ***
## inc_resp_min_qy                       -0.008846   0.001992  -4.441 8.98e-06 ***
## engines_assigned                       0.017077   0.008633   1.978  0.04794 *  
## ladders_assigned                       0.075943   0.013537   5.610 2.04e-08 ***
## others_units_assigned                  0.064331   0.008521   7.550 4.52e-14 ***
## day_typeWeekend                        0.001355   0.011469   0.118  0.90598    
## working_hourTRUE                      -0.087453   0.030823  -2.837  0.00455 ** 
## time_of_dayMorning                     0.192540   0.031969   6.023 1.74e-09 ***
## time_of_dayAfternoon                   0.166589   0.034707   4.800 1.60e-06 ***
## time_of_dayEvening                    -0.001395   0.017024  -0.082  0.93469    
## tua_is_oneTRUE                        -0.155973   0.018907  -8.250  < 2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 0.7975 on 24554 degrees of freedom
## Multiple R-squared:  0.2997, Adjusted R-squared:  0.299 
## F-statistic: 420.4 on 25 and 24554 DF,  p-value: < 2.2e-16

Remember again that we can’t compare the \(R^2\) between the former and the latter model since the response is not on the same scale.

View the residuals.

par(mfrow=c(2,2))
plot(lm_em_full_log)

  1. The Residuals VS Fitted seems pretty good since the red line follow reasonably well the 0 horizontal dotted line, indicating our residual have a linear relation with the fitted values
  2. The Q-Q Plots again tells us that we are not in a situation of normally distributed residuals since we have the two tail that goes outside the 95% confidence interval.
  3. In the Scale - Location Plots homoscedasticity is quite satisfied even if we have a slight decreasing trend and we have a peek on the middle.
  4. In the Residuals VS Leverage Plots we can see that there are not any influential point that we have to process and investigate.

Intrestingly again we note the presence of three clusters on the Residuals VS Fitted and Scale - Location plot, let’s investigate a bit in order to gain some additional information.

ggplot(lm_em_full_log, aes(x = .fitted, y = .resid)) +
  geom_point(aes(color = inc_class_group)) +
  geom_hline(yintercept = 0) +
  labs(title = "Residuals VS Fitted",
       x = "Fitted Values", y = "Residuals", color = "Incident Class Group")

The right cluster is for the Structural Fires, the middle one contain both Medical and NonMedical Emergencies with some Strictural and NonStructural Fires, and the left one contains NonMedical Emergencies and a small number of Medical one and Medical and NonMedical MFAs.

qqPlot(residuals(lm_em_full_log))

## [1]   733 19749

Here the right tail is less far from the 95 confidence interval respect the previous model, but on the other hand the left tail is heavily skewed to the bottom of the interval. Indicating that again we do not reach the normal distribution of the residual.

In conclusion we end up in a situation where the linearity assumptions are not meet thus we can’t use a regression model to perform prediction even with the logarithm transformation of the response. It is much likely that a powerful methods should be taken into account for this analysis with a deeper study of the relationship between predictors.

Moreover we think also that a deeper consideration of the outliers should be taken into account since are those observations that bring with them many informations.

8 Cast the anaysis to a Calssification task

As mention before we decided like the professor suggested to us, to cast our regression problem in a classification problem by dividing the range of possible time difference response into 2 for a binary classification or more than 2 for a multi-classification task.

We will use the same predictors of the Regression Section so first inc_resp_min_qy and then emergency_min_qy.

So first thing first we have to decide the range of value for both of them.

summary(fire_data_new$inc_resp_min_qy)
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##   0.350   4.383   5.467   5.949   6.850  58.917
summary(fire_data_new$emergency_min_qy)
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##    0.05    7.15   11.93   17.08   19.58  596.38

We start by considering a classical binary classification in which the threshold used as response is the mean of the respective responses, thus:

# threshold for inc_resp_min_qy
th_irm <- summary(fire_data_new$inc_resp_min_qy)[4]

# threshold for emergency_min_qy
th_eme <- summary(fire_data_new$emergency_min_qy)[4]

8.1 Use the range of inc_resp_min_qy as response

# make a copy of the train and test
cl_resp_min_fd.train <- fire_data.train
cl_resp_min_fd.test <- fire_data.test

cl_resp_min_fd.train$fast_response <- cl_resp_min_fd.train$inc_resp_min_qy < th_irm
cl_resp_min_fd.test$fast_response <- cl_resp_min_fd.test$inc_resp_min_qy < th_irm

# remove the future time differences
cl_resp_min_fd.train <- cl_resp_min_fd.train %>% select(-c(emergency_min_qy, inc_resp_min_qy)) #ticket_time, disp_resp_min_qy, inc_travel_min_qy,
cl_resp_min_fd.test <- cl_resp_min_fd.test %>% select(-c(emergency_min_qy, inc_resp_min_qy))

8.1.1 GLM

Fit our full GLM using the family binomial.

res_glm.fit_full <- glm(fast_response ~ ., data = cl_resp_min_fd.train, family = binomial)
summary(res_glm.fit_full)
## 
## Call:
## glm(formula = fast_response ~ ., family = binomial, data = cl_resp_min_fd.train)
## 
## Coefficients:
##                                        Estimate Std. Error z value Pr(>|z|)    
## (Intercept)                           -0.440636   0.191213  -2.304 0.021199 *  
## inc_boroughBrooklyn                    0.793697   0.096371   8.236  < 2e-16 ***
## inc_boroughManhattan                   0.253058   0.081683   3.098 0.001948 ** 
## inc_boroughQueens                      0.001828   0.087868   0.021 0.983404    
## inc_boroughStaten Island               0.769430   0.157486   4.886 1.03e-06 ***
## cong_dist5                             0.563403   0.120135   4.690 2.74e-06 ***
## cong_dist6                            -0.007701   0.125270  -0.061 0.950981    
## cong_dist7                             0.255433   0.129206   1.977 0.048048 *  
## cong_dist8                             0.123049   0.145485   0.846 0.397672    
## cong_dist9                             0.012119   0.148008   0.082 0.934741    
## cong_dist10                            0.263689   0.144203   1.829 0.067460 .  
## cong_dist11                           -0.078851   0.174495  -0.452 0.651354    
## cong_dist12                           -0.613132   0.152444  -4.022 5.77e-05 ***
## cong_dist13                            0.016054   0.145724   0.110 0.912276    
## cong_dist14                           -0.334900   0.133211  -2.514 0.011935 *  
## cong_dist15                           -0.066783   0.145563  -0.459 0.646385    
## cong_dist16                           -0.343375   0.202869  -1.693 0.090533 .  
## al_source_descEMS                      0.327563   0.094391   3.470 0.000520 ***
## al_source_descEMS-911                  0.338690   0.095408   3.550 0.000385 ***
## al_source_descCLASS-3                 -0.208798   0.049861  -4.188 2.82e-05 ***
## al_source_descOthers                   0.587118   0.108623   5.405 6.48e-08 ***
## al_index_descInitial Alarm             0.105760   0.058726   1.801 0.071717 .  
## al_index_descOthers                   -1.550582   1.125302  -1.378 0.168226    
## highest_al_levelNonFirst Alarm         1.174835   1.221176   0.962 0.336024    
## inc_class_groupMedical MFAs            0.175481   0.206894   0.848 0.396345    
## inc_class_groupNonMedical Emergencies -0.141491   0.092346  -1.532 0.125477    
## inc_class_groupNonMedical MFAs         0.109656   0.136285   0.805 0.421046    
## inc_class_groupNonStructural Fires     0.476070   0.153331   3.105 0.001904 ** 
## inc_class_groupStructural Fires        0.329006   0.127164   2.587 0.009674 ** 
## engines_assigned                       0.350136   0.025865  13.537  < 2e-16 ***
## ladders_assigned                      -0.047956   0.039152  -1.225 0.220633    
## others_units_assigned                  0.216603   0.033077   6.548 5.81e-11 ***
## day_typeWeekend                        0.116748   0.031931   3.656 0.000256 ***
## working_hourTRUE                       0.382974   0.082846   4.623 3.79e-06 ***
## time_of_dayMorning                     0.048857   0.085840   0.569 0.569240    
## time_of_dayAfternoon                   0.165768   0.093542   1.772 0.076374 .  
## time_of_dayEvening                     0.787088   0.047067  16.723  < 2e-16 ***
## tua_is_oneTRUE                        -0.890724   0.055718 -15.986  < 2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for binomial family taken to be 1)
## 
##     Null deviance: 32991  on 24579  degrees of freedom
## Residual deviance: 29270  on 24542  degrees of freedom
## AIC: 29346
## 
## Number of Fisher Scoring iterations: 5

We have an \(AIC = 29346\), so in order to use this measure we have to compare this model with an other one.

Update the previous model by adding an interaction terms and removing the non-significant predictors.

res_glm.fit_full_upd <- update(res_glm.fit_full, . ~ . - highest_al_level - ladders_assigned + log(others_units_assigned + 1) + log(engines_assigned + 1) : al_source_desc, data = cl_resp_min_fd.train)
summary(res_glm.fit_full_upd)
## 
## Call:
## glm(formula = fast_response ~ inc_borough + cong_dist + al_source_desc + 
##     al_index_desc + inc_class_group + engines_assigned + others_units_assigned + 
##     day_type + working_hour + time_of_day + tua_is_one + log(others_units_assigned + 
##     1) + al_source_desc:log(engines_assigned + 1), family = binomial, 
##     data = cl_resp_min_fd.train)
## 
## Coefficients:
##                                                  Estimate Std. Error z value
## (Intercept)                                     -1.293451   0.199881  -6.471
## inc_boroughBrooklyn                              0.825966   0.096958   8.519
## inc_boroughManhattan                             0.260902   0.082309   3.170
## inc_boroughQueens                                0.032433   0.088326   0.367
## inc_boroughStaten Island                         0.804907   0.158187   5.088
## cong_dist5                                       0.542691   0.120257   4.513
## cong_dist6                                      -0.006665   0.125399  -0.053
## cong_dist7                                       0.268477   0.129342   2.076
## cong_dist8                                       0.160544   0.145804   1.101
## cong_dist9                                       0.036823   0.148321   0.248
## cong_dist10                                      0.266674   0.144500   1.845
## cong_dist11                                     -0.078938   0.174870  -0.451
## cong_dist12                                     -0.568958   0.152845  -3.722
## cong_dist13                                      0.077464   0.146147   0.530
## cong_dist14                                     -0.333350   0.133325  -2.500
## cong_dist15                                     -0.024201   0.145927  -0.166
## cong_dist16                                     -0.311841   0.203628  -1.531
## al_source_descEMS                                1.518630   0.188914   8.039
## al_source_descEMS-911                            1.817569   0.217926   8.340
## al_source_descCLASS-3                            0.615861   0.125225   4.918
## al_source_descOthers                             0.364355   0.170183   2.141
## al_index_descInitial Alarm                       0.118721   0.058894   2.016
## al_index_descOthers                              0.428245   0.556270   0.770
## inc_class_groupMedical MFAs                      0.218075   0.205951   1.059
## inc_class_groupNonMedical Emergencies            0.130132   0.092902   1.401
## inc_class_groupNonMedical MFAs                   0.370978   0.137041   2.707
## inc_class_groupNonStructural Fires               0.577434   0.152818   3.779
## inc_class_groupStructural Fires                  0.541060   0.128237   4.219
## engines_assigned                                 0.002155   0.088429   0.024
## others_units_assigned                           -0.049217   0.061872  -0.795
## day_typeWeekend                                  0.123105   0.032125   3.832
## working_hourTRUE                                 0.377713   0.083195   4.540
## time_of_dayMorning                               0.047681   0.086211   0.553
## time_of_dayAfternoon                             0.169005   0.093953   1.799
## time_of_dayEvening                               0.789612   0.047341  16.679
## tua_is_oneTRUE                                  -0.532112   0.061669  -8.629
## log(others_units_assigned + 1)                   0.467684   0.136002   3.439
## al_source_descPHONE:log(engines_assigned + 1)    1.166921   0.186483   6.258
## al_source_descEMS:log(engines_assigned + 1)     -0.616517   0.278018  -2.218
## al_source_descEMS-911:log(engines_assigned + 1) -1.025804   0.319131  -3.214
## al_source_descCLASS-3:log(engines_assigned + 1)  0.132610   0.253631   0.523
## al_source_descOthers:log(engines_assigned + 1)   1.800903   0.317961   5.664
##                                                 Pr(>|z|)    
## (Intercept)                                     9.73e-11 ***
## inc_boroughBrooklyn                              < 2e-16 ***
## inc_boroughManhattan                            0.001525 ** 
## inc_boroughQueens                               0.713474    
## inc_boroughStaten Island                        3.61e-07 ***
## cong_dist5                                      6.40e-06 ***
## cong_dist6                                      0.957615    
## cong_dist7                                      0.037921 *  
## cong_dist8                                      0.270854    
## cong_dist9                                      0.803926    
## cong_dist10                                     0.064967 .  
## cong_dist11                                     0.651696    
## cong_dist12                                     0.000197 ***
## cong_dist13                                     0.596083    
## cong_dist14                                     0.012410 *  
## cong_dist15                                     0.868279    
## cong_dist16                                     0.125664    
## al_source_descEMS                               9.08e-16 ***
## al_source_descEMS-911                            < 2e-16 ***
## al_source_descCLASS-3                           8.74e-07 ***
## al_source_descOthers                            0.032278 *  
## al_index_descInitial Alarm                      0.043817 *  
## al_index_descOthers                             0.441389    
## inc_class_groupMedical MFAs                     0.289660    
## inc_class_groupNonMedical Emergencies           0.161293    
## inc_class_groupNonMedical MFAs                  0.006788 ** 
## inc_class_groupNonStructural Fires              0.000158 ***
## inc_class_groupStructural Fires                 2.45e-05 ***
## engines_assigned                                0.980561    
## others_units_assigned                           0.426346    
## day_typeWeekend                                 0.000127 ***
## working_hourTRUE                                5.62e-06 ***
## time_of_dayMorning                              0.580213    
## time_of_dayAfternoon                            0.072046 .  
## time_of_dayEvening                               < 2e-16 ***
## tua_is_oneTRUE                                   < 2e-16 ***
## log(others_units_assigned + 1)                  0.000584 ***
## al_source_descPHONE:log(engines_assigned + 1)   3.91e-10 ***
## al_source_descEMS:log(engines_assigned + 1)     0.026586 *  
## al_source_descEMS-911:log(engines_assigned + 1) 0.001307 ** 
## al_source_descCLASS-3:log(engines_assigned + 1) 0.601082    
## al_source_descOthers:log(engines_assigned + 1)  1.48e-08 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for binomial family taken to be 1)
## 
##     Null deviance: 32991  on 24579  degrees of freedom
## Residual deviance: 29028  on 24538  degrees of freedom
## AIC: 29112
## 
## Number of Fisher Scoring iterations: 5
AIC(res_glm.fit_full, res_glm.fit_full_upd)
##                      df      AIC
## res_glm.fit_full     38 29346.47
## res_glm.fit_full_upd 42 29111.79

We see a substantial decrease of the AIC for the updated model, thus we will use that one to make our predictions.

res_glm.probs <- predict(res_glm.fit_full_upd, newdata = cl_resp_min_fd.test, type = "response")

The agreement between predictions and observed survival data is conveniently summarized with a confusion matrix. Below we assign a fast response incident if the estimated probability of being fast is larger than 0.5:

res_preds50 <- predict(res_glm.fit_full_upd, newdata = cl_resp_min_fd.test, type = "response") > 0.5
table(preds = res_preds50, true = cl_resp_min_fd.test$fast_response)
##        true
## preds   FALSE TRUE
##   FALSE  1188  696
##   TRUE   1313 2949

The accuracy of the logistic regression classifier with the 50% threshold is:

mean(res_preds50 == cl_resp_min_fd.test$fast_response)
## [1] 0.6731207

Sensitivity and specificity are 2949 / 3645 = 0.81 and 1188 / 2501 = 0.48, respectively.

The ROC curve can be computed with package pROC:

res_glm.roc <- roc(cl_resp_min_fd.test$fast_response ~ res_glm.probs, plot = TRUE, print.auc = TRUE)

The AUC for this logistic regression is 0.73. Now we use the function coords of package pROC to extract the coordinates of the ROC at the best point, which corresponding to the maximum of the sum of sensitivity and specificity (see the online help of coords for more details

coords(res_glm.roc, x = "best", ret = "all")
##           threshold specificity sensitivity accuracy   tn   tp   fn  fp
## threshold 0.6278798     0.72491    0.630727 0.669053 1813 2299 1346 688
##                 npv       ppv       fdr     fpr      tpr     tnr      fnr
## threshold 0.5739158 0.7696686 0.2303314 0.27509 0.630727 0.72491 0.369273
##           1-specificity 1-sensitivity 1-accuracy     1-npv     1-ppv precision
## threshold       0.27509      0.369273   0.330947 0.4260842 0.2303314 0.7696686
##             recall   youden closest.topleft
## threshold 0.630727 1.355637        0.212037

Moreover according to the output of coords, the optimal choice corresponds to a threshold of 0.63 with a corresponding accuracy of:

res_acc_glm <- coords(res_glm.roc, x = "best", ret = "all")$accuracy
res_acc_glm
## [1] 0.669053